Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

Article Types

Countries / Regions

Search Results (135)

Search Parameters:
Keywords = lexical complexity

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
17 pages, 1173 KB  
Article
DiCo-EXT: Diversity and Consistency-Guided Framework for Extractive Summarization
by Yiming Wang and Jindong Zhang
Entropy 2026, 28(1), 88; https://doi.org/10.3390/e28010088 - 12 Jan 2026
Viewed by 170
Abstract
ROUGE is a common objective for extractive summarization because n-gram overlap aligns with sentence-level selection. However, models that focus only on ROUGE often choose sentences with similar content, and the resulting summaries contain redundant information. We propose DiCo-EXT, a training framework that integrates [...] Read more.
ROUGE is a common objective for extractive summarization because n-gram overlap aligns with sentence-level selection. However, models that focus only on ROUGE often choose sentences with similar content, and the resulting summaries contain redundant information. We propose DiCo-EXT, a training framework that integrates two new loss terms into a standard extractive model: a semantic consistency term and a diversity penalty. The consistency module encourages selected sentences to stay close to document-level meaning, and the diversity penalty reduces semantic overlap within the summary. Both components are fully differentiable and can be optimized together with the base loss, without extra heuristics or multi-stage post-processing. Experiments on CNN/DailyMail, XSum, and WikiHow show lower redundancy and higher lexical diversity, while ROUGE remains comparable to a strong baseline. These results indicate that simple training objectives can balance coverage and redundancy without increasing model size or architectural complexity. Full article
Show Figures

Figure 1

12 pages, 1115 KB  
Communication
Linguistic Influence on Multidimensional Word Embeddings: Analysis of Ten Languages
by Anna V. Aleshina, Andrey L. Bulgakov, Yanliang Xin and Larisa S. Skrebkova
Computation 2026, 14(1), 16; https://doi.org/10.3390/computation14010016 - 9 Jan 2026
Viewed by 190
Abstract
Understanding how linguistic typology shapes multilingual embeddings is important for cross-lingual NLP. We examine static MUSE word embedding for ten diverse languages (English, Russian, Chinese, Arabic, Indonesian, German, Lithuanian, Hindi, Tajik and Persian). Using pairwise cosine distances, Random Forest classification, and UMAP visualization, [...] Read more.
Understanding how linguistic typology shapes multilingual embeddings is important for cross-lingual NLP. We examine static MUSE word embedding for ten diverse languages (English, Russian, Chinese, Arabic, Indonesian, German, Lithuanian, Hindi, Tajik and Persian). Using pairwise cosine distances, Random Forest classification, and UMAP visualization, we find that language identity and script type largely determine embedding clusters, with morphological complexity affecting cluster compactness and lexical overlap connecting clusters. The Random Forest model predicts language labels with high accuracy (≈98%), indicating strong language-specific patterns in embedding space. These results highlight script, morphology, and lexicon as key factors influencing multilingual embedding structures, informing linguistically aware design of cross-lingual models. Full article
Show Figures

Figure 1

22 pages, 933 KB  
Article
An Entity Relationship Extraction Method Based on Multi-Mechanism Fusion and Dynamic Adaptive Networks
by Xiantao Jiang, Xin Hu and Bowen Zhou
Information 2026, 17(1), 38; https://doi.org/10.3390/info17010038 - 3 Jan 2026
Viewed by 321
Abstract
This study introduces a multi-mechanism entity–relation extraction model designed to address persistent challenges in natural language processing, including syntactic complexity, long-range dependency modeling, and suboptimal utilization of contextual information. The proposed architecture integrates several complementary components. First, a pre-trained Chinese-RoBERTa-wwm-ext encoder with a [...] Read more.
This study introduces a multi-mechanism entity–relation extraction model designed to address persistent challenges in natural language processing, including syntactic complexity, long-range dependency modeling, and suboptimal utilization of contextual information. The proposed architecture integrates several complementary components. First, a pre-trained Chinese-RoBERTa-wwm-ext encoder with a whole-word masking strategy is employed to preserve lexical semantics and enhance contextual representations for multi-character Chinese text. Second, BiLSTM-based sequential modeling is incorporated to capture bidirectional contextual dependencies, facilitating the identification of distant entity relations. Third, the combination of multi-head attention and gated attention mechanisms enables the model to selectively emphasize salient semantic cues while suppressing irrelevant information. To further improve global prediction consistency, a Conditional Random Field (CRF) layer is applied at the output stage. Building upon this multi-mechanism framework, an adaptive dynamic network is introduced to enable input-dependent activation of feature modeling modules based on sentence-level semantic complexity. Rather than enforcing a fixed computation pipeline, the proposed mechanism supports flexible and context-aware feature interaction, allowing the model to better accommodate heterogeneous sentence structures. Experimental results on benchmark datasets demonstrate that the proposed approach achieves strong extraction performance and improved robustness, making it a flexible solution for downstream applications such as knowledge graph construction and semantic information retrieval. Full article
Show Figures

Figure 1

18 pages, 282 KB  
Article
Not Strictly a Woman—QUD-Based Four-Valent Reasoning Discharges Lexical Meaning
by Emil Eva Rosina and Franci Mangraviti
Logics 2025, 3(4), 16; https://doi.org/10.3390/logics3040016 - 11 Dec 2025
Viewed by 328
Abstract
We offer a framework that captures both context-dependency and vagueness of predicate meanings—illustrated by the politically relevant case of woman—as an interaction of lexical meaning and Question under Discussion (‘QUD’). We extend existing approaches to non-maximality to superficially polysemous predicates like woman [...] Read more.
We offer a framework that captures both context-dependency and vagueness of predicate meanings—illustrated by the politically relevant case of woman—as an interaction of lexical meaning and Question under Discussion (‘QUD’). We extend existing approaches to non-maximality to superficially polysemous predicates like woman and show that this is conceptually coherent and insightful for a linguistic analysis of political debates about gender invitation policies: while there are (i) clear, semantically true, and (ii) strictly unacceptable cases of x is a woman, there are also (iii) merely pragmatically acceptable cases (‘like a woman with respect to the QUD’) as well as (iv) truly vague ones. We argue that this four-way division is the least complex model that captures current gender discourses in a harm-reducing, trans-inclusive way. This offers a new perspective on the semantics–pragmatics interface of predicate meanings. Full article
(This article belongs to the Special Issue Logic, Language, and Information)
9 pages, 437 KB  
Article
Readability Optimization of Layperson Summaries in Urological Oncology Clinical Trials: Outcomes from the BRIDGE-AI 8 Study
by Ilicia Cano, Aalamnoor Pannu, Ethan Layne, Conner Ganjavi, Aditya Desai, Gus Miranda, Jie Cai, Vasileios Magoulianitis, Karan Gill, Gerhard Fuchs, Mihir Desai, Inderbir Gill and Giovanni E. Cacciamani
Curr. Oncol. 2025, 32(12), 696; https://doi.org/10.3390/curroncol32120696 - 10 Dec 2025
Viewed by 541
Abstract
Accessible health information is essential to promote patient engagement and informed participation in clinical research. Brief summaries on ClinicalTrials.gov are indented for lay people; however they are often written at a reading level that is too advanced for the public. This study evaluated [...] Read more.
Accessible health information is essential to promote patient engagement and informed participation in clinical research. Brief summaries on ClinicalTrials.gov are indented for lay people; however they are often written at a reading level that is too advanced for the public. This study evaluated the performance of a Generative Artificial Intelligence (GAI)-powered tool—Pub2Post—in producing readable and complete layperson brief summaries for urologic oncology clinical trials. Twenty actively recruiting clinical trials on prostate, bladder, kidney, and testis cancers were retrieved from ClinicalTrials.gov. For each, a GAI-generated summary was produced and compared with its publicly available counterpart. Readability indices, grade-level indicators, and text metrics were analyzed alongside content inclusion across eight structural domains. GAI-generated summaries demonstrated markedly improved readability (mean FRES 73.3 ± 3.5 vs. 17.0 ± 13.1; p < 0.0001), aligning with the recommended middle-school reading level, and achieved 100% inclusion of guideline-defined content elements. GAI summaries exhibited simpler syntax and reduced lexical complexity, supporting improved comprehension. These findings suggest that GAI tools such as Pub2Post can generate patient-facing summaries that are both accessible and comprehensive. Full article
Show Figures

Figure 1

26 pages, 1005 KB  
Article
A Context-Aware Lightweight Framework for Source Code Vulnerability Detection
by Yousef Sanjalawe, Budoor Allehyani and Salam Al-E’mari
Future Internet 2025, 17(12), 557; https://doi.org/10.3390/fi17120557 - 3 Dec 2025
Viewed by 580
Abstract
As software systems grow increasingly complex and interconnected, detecting vulnerabilities in source code has become a critical and challenging task. Traditional static analysis methods often fall short in capturing deep, context-dependent vulnerabilities and adapting to rapidly evolving threat landscapes. Recent efforts have explored [...] Read more.
As software systems grow increasingly complex and interconnected, detecting vulnerabilities in source code has become a critical and challenging task. Traditional static analysis methods often fall short in capturing deep, context-dependent vulnerabilities and adapting to rapidly evolving threat landscapes. Recent efforts have explored knowledge graphs and transformer-based models to enhance semantic understanding; however, these solutions frequently rely on static knowledge bases, exhibit high computational overhead, and lack adaptability to emerging threats. To address these limitations, we propose DynaKG-NER++, a novel and lightweight framework for context-aware vulnerability detection in source code. Our approach integrates lexical, syntactic, and semantic features using a transformer-based token encoder, dynamic knowledge graph embeddings, and a Graph Attention Network (GAT). We further introduce contrastive learning on vulnerability–patch pairs to improve discriminative capacity and design an attention-based fusion module to combine token and entity representations adaptively. A key innovation of our method is the dynamic construction and continual update of the knowledge graph, allowing the model to incorporate newly published CVEs and evolving relationships without retraining. We evaluate DynaKG-NER++ on five benchmark datasets, demonstrating superior performance across span-level F1 (89.3%), token-level accuracy (93.2%), and AUC-ROC (0.936), while achieving the lowest false positive rate (5.1%) among state-of-the-art baselines. Sta tistical significance tests confirm that these improvements are robust and meaningful. Overall, DynaKG-NER++ establishes a new standard in vulnerability detection, balancing accuracy, adaptability, and efficiency, making it highly suitable for deployment in real-world static analysis pipelines and resource-constrained environments. Full article
(This article belongs to the Topic Addressing Security Issues Related to Modern Software)
Show Figures

Figure 1

14 pages, 1691 KB  
Article
Phonological Neighborhood Density and Type Modulate Visual Recognition of Mandarin Chinese: Evidence from Monosyllabic Words
by Zhongyan Jiao, Xianhui Zhou and Wenjun Chen
Brain Sci. 2025, 15(12), 1304; https://doi.org/10.3390/brainsci15121304 - 2 Dec 2025
Viewed by 404
Abstract
Background: Examining the influence of phonological neighborhoods on the early stages of visual word recognition provides insights into the architecture and dynamics of lexical representation and processing. Methods: Using event-related potentials (ERPs), this investigation explored how phonological neighborhood density (PND; large vs. small) [...] Read more.
Background: Examining the influence of phonological neighborhoods on the early stages of visual word recognition provides insights into the architecture and dynamics of lexical representation and processing. Methods: Using event-related potentials (ERPs), this investigation explored how phonological neighborhood density (PND; large vs. small) and type (PNT; tone-edit vs. constituent-edit neighbors) influence the recognition of monosyllabic words in Mandarin Chinese. Participants engaged in a priming paradigm combined with a visual lexical decision task. Results: Behavioral data demonstrated the main effect of PNT: words with tone-edit neighbors produced greater processing inhibition compared to those with constituent-edit neighbors. ERP results revealed that large PND enhanced the P200 amplitude, a frontal-mediated effect that was particularly pronounced for tone-edit neighbors. This early differentiation subsequently propelled a stronger N400 response to tone-edit neighbors, culminating in a significant interaction between PND and PNT during the N400 window. Conclusions: These findings support a cascaded competition model: early PND assessment (P200), enhanced for tone neighbors, amplifies their later N400 conflict. This neural mechanism elucidates the hierarchical organization of phonological processing in Chinese monosyllabic words, thereby clarifying a core component which underpins the recognition of more complex words in Mandarin. Full article
(This article belongs to the Section Neurolinguistics)
Show Figures

Figure 1

20 pages, 724 KB  
Article
A Lightweight Multimodal Framework for Misleading News Classification Using Linguistic and Behavioral Biometrics
by Mahmudul Haque, A. S. M. Hossain Bari and Marina L. Gavrilova
J. Cybersecur. Priv. 2025, 5(4), 104; https://doi.org/10.3390/jcp5040104 - 25 Nov 2025
Viewed by 671
Abstract
The widespread dissemination of misleading news presents serious challenges to public discourse, democratic institutions, and societal trust. Misleading-news classification (MNC) has been extensively studied through deep neural models that rely mainly on semantic understanding or large-scale pretrained language models. However, these methods often [...] Read more.
The widespread dissemination of misleading news presents serious challenges to public discourse, democratic institutions, and societal trust. Misleading-news classification (MNC) has been extensively studied through deep neural models that rely mainly on semantic understanding or large-scale pretrained language models. However, these methods often lack interpretability and are computationally expensive, limiting their practical use in real-time or resource-constrained environments. Existing approaches can be broadly categorized into transformer-based text encoders, hybrid CNN–LSTM frameworks, and fuzzy-logic fusion networks. To advance research on MNC, this study presents a lightweight multimodal framework that extends the Fuzzy Deep Hybrid Network (FDHN) paradigm by introducing a linguistic and behavioral biometric perspective to MNC. We reinterpret the FDHN architecture to incorporate linguistic cues such as lexical diversity, subjectivity, and contradiction scores as behavioral signatures of deception. These features are processed and fused with semantic embeddings, resulting in a model that captures both what is written and how it is written. The design of the proposed method ensures the trade-off between feature complexity and model generalizability. Experimental results demonstrate that the inclusion of lightweight linguistic and behavioral biometric features significantly enhance model performance, yielding a test accuracy of 71.91 ± 0.23% and a macro F1 score of 71.17 ± 0.26%, outperforming the state-of-the-art method. The findings of the study underscore the utility of stylistic and affective cues in MNC while highlighting the need for model simplicity to maintain robustness and adaptability. Full article
(This article belongs to the Special Issue Multimedia Security and Privacy)
Show Figures

Figure 1

27 pages, 1140 KB  
Article
Flattening the Developmental Staircase: Lexical Complexity Progression in Elementary Reading Texts Across Six Decades
by Elfrieda H. Hiebert
Educ. Sci. 2025, 15(11), 1546; https://doi.org/10.3390/educsci15111546 - 17 Nov 2025
Viewed by 739
Abstract
This study examined lexical complexity patterns in elementary reading textbooks across four pivotal decades (1957, 1974, 1995, 2014) to understand how educational reforms have influenced developmental progressions in reading materials. The study analyzed a corpus of 320,000 words from one continuously published core [...] Read more.
This study examined lexical complexity patterns in elementary reading textbooks across four pivotal decades (1957, 1974, 1995, 2014) to understand how educational reforms have influenced developmental progressions in reading materials. The study analyzed a corpus of 320,000 words from one continuously published core reading program across grades 1–4 for four copyrights. The corpus consisted of a 20,000-word sample for each grade and year, analyzed for type-token ratio, percentage of complex words, and percentage of single-appearing words. Results revealed three major shifts: (a) systematic within-grade complexity increases in earlier programs (1957, 1974) were replaced by flat progression in later programs (1995, 2014), (b) steep across-grade differentiation collapsed with grade-to-grade increases in lexical diversity declining from greater than 100% to under 10%, and (c) first-grade expectations accelerated dramatically, whereas third- and fourth-grade texts remained remarkably stable across all six decades. By 2014, first graders encountered lexical complexity levels that characterized fourth-grade texts in 1957. These findings challenge narratives of declining text complexity and reveal that contemporary elementary readers experience compressed developmental progressions with elevated starting points but minimal growth trajectories. The implications suggest the need for reconceptualizing text design to balance appropriate challenges with systematic scaffolding, particularly for students dependent on school-based literacy instruction. Full article
(This article belongs to the Special Issue Advances in Evidence-Based Literacy Instructional Practices)
Show Figures

Figure 1

18 pages, 3645 KB  
Systematic Review
Screening of the Impact of Dual Training in the Spanish University Press: A Documentary Review
by Jesica-María Abalo Paulos, Olalla García-Fuentes, Manuela Raposo-Rivas and M. Carmen Sarceda-Gorgoso
Journal. Media 2025, 6(4), 191; https://doi.org/10.3390/journalmedia6040191 - 14 Nov 2025
Viewed by 680
Abstract
University Dual Training is constructed at the intersection of academic and professional spheres, shaping a complex and multifaceted educational model. The aim of this study is to analyze the media representation of University Dual Training within the Spanish higher education landscape. The analysis [...] Read more.
University Dual Training is constructed at the intersection of academic and professional spheres, shaping a complex and multifaceted educational model. The aim of this study is to analyze the media representation of University Dual Training within the Spanish higher education landscape. The analysis focused on news articles published in the digital press of Spanish universities between 2021 and 2025. Following the methodological principles of a systematic review, a total of 81 news items (comprising 747 lexical segments) were identified and categorized 60 from 25 public universities and 21 from 7 private institutions. Data analysis, supported by the MAXQDA 24 software, enabled the identification of trends in the use of keywords, temporal evolution, and prevailing themes, along with the degree of relevance attributed to this training modality. The findings reveal an institutional tendency in media dissemination centred on promoting University Dual Training as a pathway for educational innovation, highlighting experiences and collaborations with companies, and projecting a discourse in which universities present themselves as committed to this modality. The study concludes that digital university newspapers convey the relevance and impact of University Dual Training as a modality that brings together diverse stakeholders, creating a space of collaboration and shared responsibility that strengthens student training and employability. Full article
Show Figures

Figure 1

28 pages, 20548 KB  
Article
KGGCN: A Unified Knowledge Graph-Enhanced Graph Convolutional Network Framework for Chinese Named Entity Recognition
by Xin Chen, Liang He, Weiwei Hu and Sheng Yi
AI 2025, 6(11), 290; https://doi.org/10.3390/ai6110290 - 13 Nov 2025
Viewed by 881
Abstract
Recent advances in Chinese Named Entity Recognition (CNER) have integrated lexical features and factual knowledge into pretrained language models. However, existing lexicon-based methods often inject knowledge as restricted, isolated token-level information, lacking rich semantic and structural context. Knowledge graphs (KGs), comprising relational triples, [...] Read more.
Recent advances in Chinese Named Entity Recognition (CNER) have integrated lexical features and factual knowledge into pretrained language models. However, existing lexicon-based methods often inject knowledge as restricted, isolated token-level information, lacking rich semantic and structural context. Knowledge graphs (KGs), comprising relational triples, offer explicit relational semantics and reasoning capabilities, while Graph Convolutional Networks (GCNs) effectively capture complex sentence structures. We propose KGGCN, a unified KG-enhanced GCN framework for CNER. KGGCN introduces external factual knowledge without disrupting the original word order, employing a novel end-append serialization scheme and a visibility matrix to control interaction scope. The model further utilizes a two-phase GCN stack, combining a standard GCN for robust aggregation with a multi-head attention GCN for adaptive structural refinement, to capture multi-level structural information. Experiments on four Chinese benchmark datasets demonstrate KGGCN’s superior performance. It achieves the highest F1-scores on MSRA (95.96%) and Weibo (71.98%), surpassing previous bests by 0.26 and 1.18 percentage points, respectively. Additionally, KGGCN obtains the highest Recall on OntoNotes (84.28%) and MSRA (96.14%), and the highest Precision on MSRA (95.82%), Resume (96.40%), and Weibo (72.14%). These results highlight KGGCN’s effectiveness in leveraging structured knowledge and multi-phase graph modeling to enhance entity recognition accuracy and coverage across diverse Chinese texts. Full article
Show Figures

Figure 1

19 pages, 1791 KB  
Article
Document Encoding Effects on Large Language Model Response Time and Consistency
by Dianeliz Ortiz Martes and Nezamoddin N. Kachouie
Computers 2025, 14(11), 493; https://doi.org/10.3390/computers14110493 - 13 Nov 2025
Viewed by 714
Abstract
Large language models (LLMs) such as GPT-4 are increasingly integrated into research, industry, and enterprise workflows, yet little is known about how input file formats shape their outputs. While prior work has shown that formats can influence response time, the effects on readability, [...] Read more.
Large language models (LLMs) such as GPT-4 are increasingly integrated into research, industry, and enterprise workflows, yet little is known about how input file formats shape their outputs. While prior work has shown that formats can influence response time, the effects on readability, complexity, and semantic stability remain underexplored. This study systematically evaluates GPT-4’s responses to 100 queries drawn from 50 academic papers, each tested across four formats, TXT, DOCX, PDF, and XML, yielding 400 question–answer pairs. We have assessed two aspects of the responses to the queries: first, efficiency quantified by response time and answer length, and second, linguistic style measured by readability indices, sentence length, word length, and lexical diversity where semantic similarity was considered to control for preservation of semantic context. Results show that readability and semantic content remain stable across formats, with no significant differences in Flesch–Kincaid or Dale–Chall scores, but response time is sensitive to document encoding, with XML consistently outperforming PDF, DOCX, and TXT in the initial experiments conducted in February 2025. Verbosity, rather than input size, emerged as the main driver of latency. However, follow-up replications conducted several months later (October 2025) under the updated Microsoft Copilot Studio (GPT-4) environment showed that these latency differences had largely converged, indicating that backend improvements, particularly in GPT-4o’s document-ingestion and parsing pipelines, have reduced the earlier disparities. These findings suggest that the file format matters and affects how fast the LLMs respond, although its influence may diminish as enterprise-level AI systems continue to evolve. Overall, the content and semantics of the responses are fairly similar and consistent across different file formats, demonstrating that LLMs can handle diverse encodings without compromising response quality. For large-scale applications, adopting structured formats such as XML or semantically tagged HTML can still yield measurable throughput gains in earlier system versions, whereas in more optimized environments, such differences may become minimal. Full article
Show Figures

Figure 1

21 pages, 979 KB  
Article
How the Stakeholders’ Perception Contributes to the Pharmaceutical Strategies: A Regional Case Study in Latin America
by Talita da Silva Ferreira, Giovanni M. Pauletti and Luis Vázquez-Suárez
J. Mark. Access Health Policy 2025, 13(4), 54; https://doi.org/10.3390/jmahp13040054 - 23 Oct 2025
Viewed by 808
Abstract
Background: Stakeholders’ perception plays a crucial role in shaping pharmaceutical strategies. Stakeholders are groups interested in pharmaceutical companies’ success and outcomes. Stakeholders’ perceptions are multifaceted and impact pharmaceutical strategies, from shaping research to enhancing market access, pricing, and corporate reputation. Understanding and [...] Read more.
Background: Stakeholders’ perception plays a crucial role in shaping pharmaceutical strategies. Stakeholders are groups interested in pharmaceutical companies’ success and outcomes. Stakeholders’ perceptions are multifaceted and impact pharmaceutical strategies, from shaping research to enhancing market access, pricing, and corporate reputation. Understanding and actively managing stakeholders’ perceptions is vital for pharmaceutical companies to succeed in an increasingly complex and competitive industry. Methods: In this case study, knowledge contributions from stakeholders offered insights and strategies for application in the pharmaceutical sector. Results: Qualitative, exploratory research was conducted, which included the participation of sixteen stakeholders from different countries in Latin America, who responded to a semi-structured interview script, whose data were understood through lexical analysis in the Interface de R pour les Analyses Multimensionnelles de Texts et de Questionnaires (IRaMuTeQ). Conclusions: The results of this study underscore the importance of regulatory knowledge for professionals’ support and implementation of international strategies. Regulatory knowledge provides professionals with tools and insights to navigate complex regulatory environments, make informed decisions, and enhance organizational performance in global markets. Full article
Show Figures

Figure 1

8 pages, 218 KB  
Proceeding Paper
Towards an Explainable Linguistic Approach to the Identification of Expressive Forms Within Arabic Text
by Zouheir Banou, Sanaa El Filali, El Habib Benlahmar, Fatima-Zahra Alaoui and Laila El Jiani
Eng. Proc. 2025, 112(1), 26; https://doi.org/10.3390/engproc2025112026 - 15 Oct 2025
Viewed by 484
Abstract
This paper presents a rule-based negation and litotes detection system for Modern Standard Arabic. Unlike purely statistical approaches, the proposed pipeline leverages linguistic structures, lexical resources, and dependency parsing to identify negated expressions, exception clauses, and instances of litotic inversion, where rhetorical negation [...] Read more.
This paper presents a rule-based negation and litotes detection system for Modern Standard Arabic. Unlike purely statistical approaches, the proposed pipeline leverages linguistic structures, lexical resources, and dependency parsing to identify negated expressions, exception clauses, and instances of litotic inversion, where rhetorical negation conveys an implicit positive meaning. The system was applied to a large-scale subset of the Arabic OSCAR corpus, filtered by sentence length and syntactic structure. The results show the successful detection of 5193 negated expressions and 1953 litotic expressions through antonym matching. Additionally, 200 instances involving exception prepositions were identified, reflecting their syntactic specificity and rarity in Arabic. The system is fully interpretable, reproducible, and well-suited to low-resource environments where machine learning approaches may not be viable. Its ability to scale across heterogeneous data while preserving linguistic sensitivity demonstrates the relevance of rule-based systems for morphologically rich and structurally complex languages. This work contributes a practical framework for analyzing negation phenomena and offers insight into rhetorical inversion in Arabic discourse. Although coverage of rarer structures is limited, the pipeline provides a solid foundation for future refinement and domain-specific applications in figurative language processing. Full article
Show Figures

Figure 1

21 pages, 1197 KB  
Article
A Hybrid System for Automated Assessment of Korean L2 Writing: Integrating Linguistic Features with LLM
by Wonjin Hur and Bongjun Ji
Systems 2025, 13(10), 851; https://doi.org/10.3390/systems13100851 - 28 Sep 2025
Viewed by 1777
Abstract
The global expansion of Korean language education has created an urgent need for scalable, objective, and consistent methods for assessing the writing skills of non-native (L2) learners. Traditional manual grading is resource-intensive and prone to subjectivity, while existing Automated Essay Scoring (AES) systems [...] Read more.
The global expansion of Korean language education has created an urgent need for scalable, objective, and consistent methods for assessing the writing skills of non-native (L2) learners. Traditional manual grading is resource-intensive and prone to subjectivity, while existing Automated Essay Scoring (AES) systems often struggle with the linguistic nuances of Korean and the specific error patterns of L2 writers. This paper introduces a novel hybrid AES system designed specifically for Korean L2 writing. The system integrates two complementary feature sets: (1) a comprehensive suite of conventional linguistic features capturing lexical diversity, syntactic complexity, and readability to assess writing form and (2) a novel semantic relevance feature that evaluates writing content. This semantic feature is derived by calculating the cosine similarity between a student’s essay and an ideal, high-proficiency reference answer generated by a Large Language Model (LLM). Various machine learning models are trained on the Korean Language Learner Corpus from the National Institute of the Korean Language to predict a holistic score on the 6-level Test of Proficiency in Korean (TOPIK) scale. The proposed hybrid system demonstrates superior performance compared to baseline models that rely on either linguistic or semantic features alone. The integration of the LLM-based semantic feature provides a significant improvement in scoring accuracy, more closely aligning the automated assessment with human expert judgments. By systematically combining measures of linguistic form and semantic content, this hybrid approach provides a more holistic and accurate assessment of Korean L2 writing proficiency. The system represents a practical and effective tool for supporting large-scale language education and assessment, aligning with the need for advanced AI-driven educational technology systems. Full article
Show Figures

Figure 1

Back to TopTop