Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Article Types

Countries / Regions

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Search Results (825)

Search Parameters:
Keywords = language model benchmarking

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
19 pages, 1217 KB  
Article
Talking with Actionbits—A Part-Enhanced VLM for Action and Interaction Recognition in Animals
by Yang Yang, Ren Nakagawa, Risa Shinoda, Hiroaki Santo, Kenji Oyama, Takenao Ohkawa and Fumio Okura
Sensors 2026, 26(6), 1969; https://doi.org/10.3390/s26061969 (registering DOI) - 21 Mar 2026
Abstract
Understanding animal actions and interactions is essential for behavior analysis and ecological monitoring. Although large-scale in-the-wild datasets have advanced animal action recognition, existing methods still struggle with fine-grained motion, spatial relations, and multi-individual interactions. To address these challenges, we introduce AIRA, a unified [...] Read more.
Understanding animal actions and interactions is essential for behavior analysis and ecological monitoring. Although large-scale in-the-wild datasets have advanced animal action recognition, existing methods still struggle with fine-grained motion, spatial relations, and multi-individual interactions. To address these challenges, we introduce AIRA, a unified framework for Action and Interaction Recognition in Animals. Built upon a vision–language model (VLM), AIRA learns in an action-centered representation space defined by body parts and their corresponding motions, thereby improving robustness to background noise and enabling cross-species generalization via a unified mammal-centric part ontology. To model actions, we treat body parts and motion as primary cues and introduce Actionbit tokens—compact representations for parts and motions generated by a large language model (LLM) that encode which parts move and how. We further propose Part-Enhanced Prompt Fine-tuning (PEPF) to make the VLM explicitly sensitive to part and pose cues. Within PEPF, the Action–actionbit Alignment (AbA) module enriches action representations with fine-grained part–motion semantics, and Part-Vision Prompting (PVP) extracts keyframes through action-aware prompting. Experiments across multiple benchmarks show consistent improvements in both action and interaction recognition, highlighting the importance of action-centered adaptation and relational reasoning for understanding animal behavior in the wild. Full article
(This article belongs to the Special Issue Innovative Sensing Methods for Motion and Behavior Analysis)
Show Figures

Figure 1

50 pages, 1686 KB  
Review
Data Foundations for Medical AI: Provenance, Reliability and Limitations of Russian Clinical NLP Resources
by Arsenii Litvinov, Lev Malishevskii, Evgeny Karpulevich, Iaroslav Bespalov, Yaroslav Nedumov, Sergey Zhdanov, Ivan Oseledets, Evgeniy Shlyakhto and Arutyun Avetisyan
Informatics 2026, 13(3), 45; https://doi.org/10.3390/informatics13030045 - 20 Mar 2026
Abstract
Russian-language resources for medical natural language processing (NLP) are expanding rapidly; however, their fragmentation, uneven curation, and limited clinical reliability hinder the development of safe machine learning systems for prognosis, prevention, and precision medicine. We provide the first systematic survey of Russian medical [...] Read more.
Russian-language resources for medical natural language processing (NLP) are expanding rapidly; however, their fragmentation, uneven curation, and limited clinical reliability hinder the development of safe machine learning systems for prognosis, prevention, and precision medicine. We provide the first systematic survey of Russian medical NLP datasets and analyze their suitability for clinically meaningful tasks as defined by the MedHELM taxonomy. We additionally perform expert clinical validation of three representative public corpora—RuMedPrimeData (real outpatient notes), MedSyn (synthetic clinical notes), and RuMedNLI (translated natural language inference)—assessing clinical plausibility, diagnosis accuracy, and logical consistency. Experts identified substantial reliability issues: across randomly sampled subsets of each corpus, only approximately 20% of RuMedPrimeData records, fewer than 15% of MedSyn records, and approximately 55% of RuMedNLI pairs met essential quality criteria, which can hinder downstream ML systems built on these data. To support robust applications—ranging from medical chatbots and triage assistants to predictive and preventive models—we outline practical requirements for high-quality datasets: coordinated, expert-validated, machine-readable corpora aligned with clinical guidelines and insurance logic, standardized de-identification, and transparent provenance. Strengthening these data foundations will enable the development of reliable, reproducible, and clinically relevant AI systems suitable for real-world healthcare applications. Full article
(This article belongs to the Special Issue From Data to Evidence: Transformative AI for Real-World Data)
Show Figures

Figure 1

19 pages, 599 KB  
Article
Reducing Hallucinations in Medical AI Through Citation Enforced Prompting in RAG Systems
by Lukasz Pawlik and Stanislaw Deniziak
Appl. Sci. 2026, 16(6), 3013; https://doi.org/10.3390/app16063013 (registering DOI) - 20 Mar 2026
Abstract
The safe integration of Large Language Models in clinical environments requires strict adherence to verified medical evidence. As part of the PARROT AI project, this study provides a systematic evaluation of how prompting strategies affect the reliability of Retrieval-Augmented Generation (RAG) pipelines using [...] Read more.
The safe integration of Large Language Models in clinical environments requires strict adherence to verified medical evidence. As part of the PARROT AI project, this study provides a systematic evaluation of how prompting strategies affect the reliability of Retrieval-Augmented Generation (RAG) pipelines using the MedQA USMLE benchmark (N=500). Four prompting strategies were examined: Baseline (zero-shot), Neutral, Expert Chain-of-Thought (Expert-CoT) with structured clinical reasoning, and StrictCitations with mandatory evidence grounding. The experiments covered six modern model architectures: Command R (35B), Gemma 2 (9B and 27B), Llama 3.1 (8B), Mistral Nemo (12B), and Qwen 2.5 (14B). Evaluation was conducted using the Deterministic RAG Evaluator, providing an objective assessment of grounding through the Unsupported Sentence Ratio (USR) based on TF-IDF and cosine similarity. The results indicate that structured reasoning in the Expert-CoT strategy significantly increases USR values (reaching 95–100%), as models prioritize internal diagnostic logic over verbatim context. In contrast, the StrictCitations strategy, while maintaining high USR due to the conservative evaluation threshold, achieves the highest level of verifiable grounding and source adherence. The analysis identifies a statistically significant Verbosity Signal (r=0.81,p<0.001), where increased response length serves as a proxy for model uncertainty and parametric leakage, a pattern particularly prominent in Llama 3.1 and Gemma 2. Overall, the findings demonstrate that prompting strategy selection is as critical for clinical reliability as model architecture. This work delivers a reproducible framework for the development of trustworthy medical AI assistants and highlights citation-enforced prompting as a vital mechanism for improving clinical safety. Full article
Show Figures

Figure 1

24 pages, 2520 KB  
Article
MAFQA: A Dataset for Benchmarking Multi-Hop Arabic Fatwa Question Answering
by Manal Ali Al-Qahtani, Bader Fahad Alkhamees and Mourad Ykhlef
Data 2026, 11(3), 64; https://doi.org/10.3390/data11030064 - 20 Mar 2026
Abstract
Developing reliable Arabic question answering (QA) systems for Islamic fatwas requires datasets that capture the linguistic complexity and multi-step reasoning inherent in jurisprudential inquiries. However, the existing Arabic religious QA datasets primarily focus on direct retrieval or classification, often failing to address the [...] Read more.
Developing reliable Arabic question answering (QA) systems for Islamic fatwas requires datasets that capture the linguistic complexity and multi-step reasoning inherent in jurisprudential inquiries. However, the existing Arabic religious QA datasets primarily focus on direct retrieval or classification, often failing to address the multi-hop reasoning necessary for complex fatwa questions. To bridge this gap, we introduce MAFQA, a benchmark dataset specifically designed for multi-hop Arabic fatwa question answering. MAFQA was constructed from an extensive corpus of authentic fatwa records sourced from authoritative Islamic institutions. The dataset was developed via a semi-automated pipeline that integrates expert-guided identification of complex inquiries with a structured decomposition framework. This framework employs automated reasoning-pattern classification, semantic feature extraction, and template-guided annotation of subquestions and subanswers, followed by rigorous validation to ensure contextual grounding, logical coherence, and structural consistency. To evaluate the utility of the dataset, we conduct an extensive benchmarking study using Arabic-specialized, multilingual, and instruction-tuned language models across two primary tasks: question decomposition (QD) and generative question answering (QA). Performance is assessed using a comprehensive suite of lexical, semantic, relevance, and faithfulness metrics. Experimental results demonstrate that Arabic-specialized models consistently outperform their multilingual counterparts, with AraT5-base and AraBART achieving the highest performance in terms of lexical similarity, semantic alignment, and answer faithfulness. Full article
(This article belongs to the Section Information Systems and Data Management)
Show Figures

Figure 1

45 pages, 33530 KB  
Article
AIFloodSense: A Global Aerial Imagery Dataset for Semantic Segmentation and Understanding of Flooded Environments
by Georgios Simantiris, Konstantinos Bacharidis, Apostolos Papanikolaou, Petros Giannakakis and Costas Panagiotakis
Remote Sens. 2026, 18(6), 938; https://doi.org/10.3390/rs18060938 - 19 Mar 2026
Abstract
Accurate flood detection is critical for disaster response, yet the scarcity of diverse annotated datasets hinders robust model development. Existing resources typically suffer from limited geographic scope and insufficient annotation granularity, restricting the generalization capabilities of computer vision methods. To bridge this gap, [...] Read more.
Accurate flood detection is critical for disaster response, yet the scarcity of diverse annotated datasets hinders robust model development. Existing resources typically suffer from limited geographic scope and insufficient annotation granularity, restricting the generalization capabilities of computer vision methods. To bridge this gap, we introduce AIFloodSense, a comprehensive evaluation benchmark designed to advance domain-generalized Artificial Intelligence for climate resilience. The dataset comprises 470 high-resolution aerial images capturing 230 distinct flood events across 64 countries and six continents. Unlike prior benchmarks, AIFloodSense ensures exceptional global diversity and temporal relevance (2022–2024), supporting three complementary tasks: (i) Image Classification, featuring novel sub-tasks for environment type, camera angle, and continent recognition; (ii) Semantic Segmentation, providing precise pixel-level masks for flood, sky, buildings, and background; and (iii) Visual Question Answering (VQA), enabling natural language reasoning for disaster assessment. We provide baseline benchmarks for all tasks using state-of-the-art architectures, demonstrating the dataset’s complexity and its utility in fostering robust AI tools for environmental monitoring. Crucially, we show that despite its compact size, AIFloodSense enables better generalization on external test sets than much larger alternatives, validating the premise that rigorous diversity is more effective than scale for training robust flood detection models, and is made publicly available to accelerate further research in the field. Full article
24 pages, 2081 KB  
Article
Research on Large Language Model-Based Bibliographic Cataloging Agent in the CNMARC Context
by Zhuoxi Tan, Xin Yang, Qinyu Chen and Tao Chen
Publications 2026, 14(1), 19; https://doi.org/10.3390/publications14010019 - 18 Mar 2026
Viewed by 32
Abstract
To address the efficiency and cost limitations of traditional manual cataloging, this study proposes a large language model-driven automated cataloging workflow in which the Metadata Extraction Agent (MEA), Description Cataloging Agent (DCA), Subject Analysis & Indexing Agent (SAIA), and Quality Control Agent (QCA) [...] Read more.
To address the efficiency and cost limitations of traditional manual cataloging, this study proposes a large language model-driven automated cataloging workflow in which the Metadata Extraction Agent (MEA), Description Cataloging Agent (DCA), Subject Analysis & Indexing Agent (SAIA), and Quality Control Agent (QCA) collaborate to perform cataloging tasks. Experiments are conducted using a dataset of over 33,000 CNMARC bibliographic records from a University Library, together with data from the Chinese Library Classification (5th edition). Meanwhile, the agent-based workflow framework directly employs large language models without additional enhancement techniques, thereby providing a useful experimental benchmark for evaluating future AI-assisted cataloging systems. The results show that the framework performs well in metadata recognition, bibliographic description, and macro-level classification tasks, and can relatively stably generate standardized records. However, limitations remain in fine-grained semantic indexing and the interpretation of complex contexts. Therefore, in light of the capability limitations revealed by the experimental results, the study argues that fully automated end-to-end cataloging relying solely on generative AI is not yet entirely feasible. Future improvements should integrate techniques such as retrieval-augmented generation, supervised fine-tuning, and structured reasoning prompts, while establishing traceable mechanisms to enhance the reliability of intelligent cataloging. Full article
(This article belongs to the Special Issue Overview on Today’s AI Tools for Authors)
Show Figures

Figure 1

23 pages, 458 KB  
Article
Automated Generation and Evaluation of Interactive-Fiction Serious Games with Open-Weight LLMs
by Finn Rogosch and Andreas Schrader
Appl. Sci. 2026, 16(6), 2932; https://doi.org/10.3390/app16062932 - 18 Mar 2026
Viewed by 46
Abstract
This work investigates whether open-weight large language models can automatically generate runnable and educationally faithful serious games in a constrained, text-only interactive-fiction (IF) setting. The target games are station-based single-player serious games for knowledge assessment, implemented as IF in a structured, machine-readable text [...] Read more.
This work investigates whether open-weight large language models can automatically generate runnable and educationally faithful serious games in a constrained, text-only interactive-fiction (IF) setting. The target games are station-based single-player serious games for knowledge assessment, implemented as IF in a structured, machine-readable text format, and used here as a first step towards later ambient scenarios. A fully automated pipeline called SINE (Serious Interactive Narrative Engine) is evaluated with four prompting strategies, grammar-guided decoding, deterministic validation, and a repair agent. Across a staged evaluation with 240 seeds and increasing complexity, finalist configurations reach success rates between roughly 68% and 86% on the joint criterion of compilation, playability, and learning-goal fidelity. Repair iterations proved central to robustness, whereas grammar masking on top of reasoning prompts did not consistently improve outcomes. The study provides a reproducible benchmark setup, open artifacts, and a constrained generation pipeline as a basis for later extensions toward broader serious game scenarios. Full article
(This article belongs to the Special Issue Artificial Intelligence in Education: Latest Advances and Prospects)
Show Figures

Figure 1

31 pages, 2343 KB  
Article
Construction and Application of Heterogeneous Graph Neural Network Model Fusing Meta-Path Sequence Features
by Xingqiu Zhang and Sang-Chul Kim
Electronics 2026, 15(6), 1261; https://doi.org/10.3390/electronics15061261 - 18 Mar 2026
Viewed by 55
Abstract
In real-world applications, the prevalence of heterogeneous graph data has driven the development of heterogeneous graph neural networks (HGNNs) as an effective solution for modeling intricate semantic relationships. A widely adopted strategy involves using meta-paths as high-level structural motifs to direct neighborhood aggregation [...] Read more.
In real-world applications, the prevalence of heterogeneous graph data has driven the development of heterogeneous graph neural networks (HGNNs) as an effective solution for modeling intricate semantic relationships. A widely adopted strategy involves using meta-paths as high-level structural motifs to direct neighborhood aggregation in HGNNs. Nevertheless, the semantic content inherent in meta-paths themselves is often not fully exploited, even though they are typically employed as guiding signals. This paper introduces a new HGNN architecture that utilizes meta-path sequences, integrating the intrinsic information of meta-paths directly into the semantic fusion mechanism. By representing meta-paths as sequential data—similar to sequences in natural language processing—we are able to capture more detailed semantic patterns through the sequential order of node types in heterogeneous graphs. Using sequence modeling methods, our approach embeds meta-path semantics into the graph neural network, offering not only additional structural insights but also enabling the training of specialized embeddings for node types. We perform extensive experiments, comprising comparative and ablation analyses, on a custom-built dataset and three publicly available medium-scale heterogeneous graph benchmarks. The experimental outcomes validate the efficacy of our method in utilizing sequential characteristics of meta-paths to improve representation learning. Full article
Show Figures

Figure 1

26 pages, 977 KB  
Article
KE-MLLM: A Knowledge-Enhanced Multi-Sensor Learning Framework for Explainable Fake Review Detection
by Jiaying Chen, Jingyi Liu, Yiwen Liang and Mengjie Zhou
Appl. Sci. 2026, 16(6), 2909; https://doi.org/10.3390/app16062909 - 18 Mar 2026
Viewed by 68
Abstract
The proliferation of fake reviews on e-commerce and social platforms has severely undermined consumer trust and market integrity, necessitating robust and interpretable real-time detection mechanisms with multi-sensor data fusion capabilities. While traditional machine learning approaches have shown promise in identifying fraudulent reviews, they [...] Read more.
The proliferation of fake reviews on e-commerce and social platforms has severely undermined consumer trust and market integrity, necessitating robust and interpretable real-time detection mechanisms with multi-sensor data fusion capabilities. While traditional machine learning approaches have shown promise in identifying fraudulent reviews, they often lack transparency and fail to leverage the rich contextual knowledge embedded in large-scale datasets. In this paper, we propose KE-MLLM (Knowledge-Enhanced Multimodal Large Language Model), a unified framework that integrates knowledge-enhanced prompting with parameter-efficient fine-tuning for explainable fake review detection. Our approach employs LoRA (Low-Rank Adaptation) to fine-tune lightweight large language models (LLaMA-3-8B) on review text, while incorporating multimodal behavioral sensor signals including temporal patterns, user metadata, and social network characteristics for comprehensive anomaly sensing. To address the critical need for interpretability in fraud detection systems, we implement a Chain-of-Thought (CoT) reasoning module that generates human-understandable explanations for classification decisions, highlighting linguistic anomalies, sentiment inconsistencies, and behavioral red flags. We enhance the model’s discriminative capability through a knowledge distillation strategy that transfers domain-specific expertise from larger teacher models while maintaining computational efficiency suitable for edge sensing devices. Extensive experiments on two benchmark datasets—YelpChi and Amazon Reviews from the DGL Fraud Dataset—show that KE-MLLM achieves strong performance, reaching an F1-score of 94.3% and an AUC-ROC of 96.7% on YelpChi and outperforming the strongest baseline in our comparison by 5.8 and 4.2 percentage points, respectively. Furthermore, human evaluation indicates that the generated explanations achieve 89.5% consistency with expert annotations, suggesting that the framework can improve the interpretability and practical usefulness of automated fraud detection systems. The proposed framework provides a useful step toward more accurate and interpretable fake review detection and offers a practical reference for building more transparent and accountable AI systems in high-stakes applications. Full article
Show Figures

Figure 1

33 pages, 2332 KB  
Article
EvalHack: Answer-Side Prompt Injection for Probing LLM Exam-Grading Panel Stability
by Catalin Anghel, Marian Viorel Craciun, Adina Cocu, Andreea Alexandra Anghel, Antonio Stefan Balau, Adrian Istrate and Aurelian-Dumitrache Anghele
Information 2026, 17(3), 297; https://doi.org/10.3390/info17030297 - 18 Mar 2026
Viewed by 103
Abstract
Large language models are increasingly used as automated graders, yet their reliability under answer-side manipulation and their behavior in multi-model panels remain insufficiently understood. This paper introduces EvalHack, a matrix benchmark in which a fixed committee of four LLMs grades university-level machine learning [...] Read more.
Large language models are increasingly used as automated graders, yet their reliability under answer-side manipulation and their behavior in multi-model panels remain insufficiently understood. This paper introduces EvalHack, a matrix benchmark in which a fixed committee of four LLMs grades university-level machine learning exam answers under a strict integer-only contract (0–10) grounded in instructor-authored rubric artifacts. The dataset comprises 100 students answering 10 short, open-ended items (1000 answers). For each answer, the evaluation includes a clean version and two content-preserving adversarial variants that operate only on the student text: A1, a visible coercive suffix appended to the answer, and A2, a stealth variant that uses Unicode control characters (e.g., zero-width and bidirectional marks) to embed an instruction. EvalHack instruments the full grading pipeline, recording item-level member scores, the committee aggregate, within-panel disagreement, and discrepancies to human grades. Empirically, answer-side edits induce systematic score inflation and stronger top-end concentration, with edited answers clustering near the upper end of the scale. Within-panel disagreement, measured as the range between the highest and lowest member score, varies across conditions, with median Consistency Spread values of 3.0 (clean), 2.0 (A1), and 6.0 (A2). Compared to human graders, the panel is more lenient on average (MAE = 1.897; bias human − panel = −1.345). Finally, grouping items by disagreement shows that low-disagreement items exhibit smaller human-panel errors, indicating that within-panel spread can serve as a practical uncertainty signal for routing difficult answers to human review or to larger/more specialized panels. Full article
(This article belongs to the Section Artificial Intelligence)
Show Figures

Graphical abstract

31 pages, 1934 KB  
Review
Artificial Intelligence for Detecting Electoral Disinformation on Social Media: Models, Datasets, and Evaluation
by Félix Díaz, Nhell Cerna, Rafael Liza and Bryan Motta
Information 2026, 17(3), 292; https://doi.org/10.3390/info17030292 - 17 Mar 2026
Viewed by 126
Abstract
During elections, information manipulation on social media has accelerated the use of artificial intelligence, yet the evidence is difficult to interpret without an integrated view of methods, data, and evaluation. We mapped 557 English-language journal articles from Scopus and Web of Science, combining [...] Read more.
During elections, information manipulation on social media has accelerated the use of artificial intelligence, yet the evidence is difficult to interpret without an integrated view of methods, data, and evaluation. We mapped 557 English-language journal articles from Scopus and Web of Science, combining performance indicators, science mapping, and a focused full-text synthesis of highly cited papers. The literature grows sharply after 2019, peaks in 2025, and shows geographically uneven production, with collaboration structured around a small set of hubs. The thematic structure suggests that, during the pandemic era, infodemic-related research served as a catalyst, intensifying scientific attention to fake news and disinformation and expanding the associated detection and monitoring agendas. In addition, socio-political harm constructs such as hate speech, extremism, and polarization appear as recurrent and structurally central targets, highlighting that election-relevant work often extends beyond veracity assessment toward monitoring discourse risks. Blockchain also emerges as a novel and adjacent integrity theme, aligned with authenticity and provenance-oriented mitigation rather than mainstream detection pipelines. AI for electoral disinformation is not reducible to veracity classification, as influential studies also target automation and coordinated behavior, verification support, diffusion analysis, and estimation frameworks that focus on exposure and impact. Evaluation remains heterogeneous and is often shaped by benchmark settings, making high accuracy values hard to compare and potentially misleading when labeling quality, topic leakage, or context shift are not characterized. Overall, the findings motivate evaluation protocols that align operational objectives with modeling roles and explicitly address robustness to temporal and platform changes, asymmetric error costs during election windows, and representativeness across electoral contexts and languages, while also guiding future work on emerging integrity challenges and governance-relevant deployment settings. Full article
(This article belongs to the Section Artificial Intelligence)
Show Figures

Figure 1

12 pages, 195 KB  
Data Descriptor
A Multilingual Dataset of Student Answers, Human Grading, and Multi-LLM Evaluations for Automated Assessment Research Using JorGPT
by Jorge Cisneros-González, Natalia Gordo-Herrera, Iván Barcia-Santos, Yolanda Cerezo and Javier Sánchez-Soriano
Data 2026, 11(3), 59; https://doi.org/10.3390/data11030059 - 17 Mar 2026
Viewed by 131
Abstract
The increasing adoption of Large Language Models (LLMs) in higher education has created a need for high-quality, publicly available benchmarks for automated assessment. Existing datasets often rely on synthetic responses or lack detailed human feedback. This paper presents a multilingual dataset of 3041 [...] Read more.
The increasing adoption of Large Language Models (LLMs) in higher education has created a need for high-quality, publicly available benchmarks for automated assessment. Existing datasets often rely on synthetic responses or lack detailed human feedback. This paper presents a multilingual dataset of 3041 authentic student answers to 50 open-ended Computer Science questions, collected from real university assessments during the 2025–2026 academic year. The dataset includes the original student responses (Spanish) and their parallel translations (English), instructor (or teacher) defined ideal answers, blind human grading with qualitative feedback, and structured evaluations from three state-of-the-art LLMs (DeepSeek-chat-V3.2, Qwen-flash-2025-07-28, Gemini-2.5-flash-lite-001) using a unified JSON schema. This resource enables reproducible research in automated grading, feedback generation, and cross-lingual educational NLP. Full article
32 pages, 2055 KB  
Article
Leveraging Transformers and LLMs for Automated Grading and Feedback Generation Using a Novel Dataset
by Asmaa G. Khalf, Emad Nabil, Wael H. Gomaa, Oussama Benrhouma and Amira M. El-Mandouh
Data 2026, 11(3), 57; https://doi.org/10.3390/data11030057 - 16 Mar 2026
Viewed by 127
Abstract
Automated Short Answer Grading (ASAG) has garnered significant attention in the field of educational technology due to its potential to improve the efficiency, scalability, and consistency of student assessments. This study introduces a novel dataset of 651 student responses from a Database Transaction [...] Read more.
Automated Short Answer Grading (ASAG) has garnered significant attention in the field of educational technology due to its potential to improve the efficiency, scalability, and consistency of student assessments. This study introduces a novel dataset of 651 student responses from a Database Transaction course exam at Beni-Suef University, referred to as the Beni-Suef Transaction Processing (BeSTraP) dataset. The BeSTraP is specifically designed to support ASAG evaluation. To assess ASAG performance, five approaches were employed: string-based similarity, semantic similarity, a hybrid of both, fine-tuning transformer-based models, and the application of Large Language Models (LLMs). The experimental results indicated that fine-tuned transformers, particularly GPT-2, achieved the highest Pearson correlation with human scores (0.8813) on the new dataset and maintained robust performance on the Mohler benchmark (0.7834). In addition to grading, the framework integrates automated feedback generation through LLMs, further enriching the assessment process. This research contributes (i) a novel, domain-specific dataset derived from an actual university examination, (ii) a comprehensive comparison of traditional and transformer-based approaches, and (iii) evidence of the efficacy of fine-tuned models in providing accurate and scalable grading solutions. The created dataset will be publicly available for the community. Full article
Show Figures

Graphical abstract

23 pages, 3955 KB  
Article
Towards Self-Evolving Agents: A Dual-Process Framework for Continual Context Refinement
by Liangyu Teng, Wei Ni, Liang Song, Jun Shi and Yanfei Li
Electronics 2026, 15(6), 1232; https://doi.org/10.3390/electronics15061232 - 16 Mar 2026
Viewed by 216
Abstract
Large Language Models (LLMs) have become essential for interactive AI systems, yet they remain fundamentally static after deployment: they cannot update their parameters from interaction feedback and often repeat the same mistakes across long interaction streams. We propose Dual-Process Agent (DPA), a framework [...] Read more.
Large Language Models (LLMs) have become essential for interactive AI systems, yet they remain fundamentally static after deployment: they cannot update their parameters from interaction feedback and often repeat the same mistakes across long interaction streams. We propose Dual-Process Agent (DPA), a framework for continual context refinement that enables learning without modifying a frozen model backbone. Inspired by dual-process theory from cognitive science, DPA decomposes each interaction episode into two complementary processes: a fast System 1 that retrieves compact, relevant context from an explicit long-term memory and generates responses, and a slow System 2 that reflects on outcomes and writes curated updates back into memory. To prevent memory degradation over extended interactions, DPA maintains bulletized memory entries with utility statistics and employs a conservative curator gate that filters generic, redundant, or conflicting insertions. Experiments on six diverse benchmarks demonstrate that DPA consistently outperforms vanilla prompting and competitive baselines on both GPT-5.1 and Llama-3.1-8B backbones, achieving the best overall performance across multiple reasoning and knowledge-intensive tasks. Full article
(This article belongs to the Section Artificial Intelligence)
Show Figures

Figure 1

16 pages, 6152 KB  
Article
DisasterReliefGPT: Multimodal AI for Autonomous Disaster Impact Assessment and Crisis Communication
by Lekshmi Chandrika Reghunath, Athikkal Sudhir Abhishek, Arjun Changat, Arjun Unnikrishnan, Ayush Kumar Rai, Christian Napoli and Cristian Randieri
Technologies 2026, 14(3), 179; https://doi.org/10.3390/technologies14030179 - 16 Mar 2026
Viewed by 164
Abstract
The work presented herein proposes DisasterReliefGPT, a multimodal AI system for automation in the areas of crisis communication and post-disaster assessment. The system integrates three tightly coupled components: a vision module called DisasterOCS for structural damage detection in satellite images, a Large Vision–Language [...] Read more.
The work presented herein proposes DisasterReliefGPT, a multimodal AI system for automation in the areas of crisis communication and post-disaster assessment. The system integrates three tightly coupled components: a vision module called DisasterOCS for structural damage detection in satellite images, a Large Vision–Language Model (LVLM) for enhanced visual understanding and contextual reasoning, and a Large Language Model (LLM) to produce detailed, clear assessment reports. DisasterOCS relies on a ResNet34-based encoder with partial weight sharing and event-specific decoders, coupled with a custom MultiCrossEntropyDiceLoss function for multi-class segmentation on pre- and post-disaster image pairs. On the benchmark xBD dataset, the developed system reaches a high score of 78.8% in identifying F1-damage, making correct identifications of destroyed buildings with 81.3% precision, while undamaged structures are found with a very high value of 90.7%. From a combination of these components, emergency responders can immediately provide reliable and readable assessments of damage that can be used to directly support urgent decision-making. Full article
Show Figures

Graphical abstract

Back to TopTop