MDPI - Publisher of Open Access Journals

24 pages, 453 KB

Open AccessArticle

Reason2Decide-C: Adaptive Cycle-Consistent Training for Clinical Rationales

by H M Quamran Hasan, Housam Khalifa Bashier Babiker, Mi-Young Kim and Randy Goebel

Computers 2026, 15(5), 279; https://doi.org/10.3390/computers15050279 - 27 Apr 2026

Large Language Models (LLMs) used for clinical decision support must not only make accurate predictions but also generate rationales that are consistent with, and sufficient for, those predictions. Building on Reason2Decide, a two-stage rationale-driven multi-task framework, we propose Reason2Decide-C (R2D-C, where C denotes [...] Read more.

Large Language Models (LLMs) used for clinical decision support must not only make accurate predictions but also generate rationales that are consistent with, and sufficient for, those predictions. Building on Reason2Decide, a two-stage rationale-driven multi-task framework, we propose Reason2Decide-C (R2D-C, where C denotes cycle consistency), which augments Reason2Decide’s stage 2 training with confidence-adaptive scheduled sampling and cycle-consistent rationale-to-label training. In stage 1, we pretrain our model on rationale generation. In stage 2, we jointlytrain on label prediction and rationale generation, gradually replacing gold labels with model-predicted labels based on confidence. Simultaneously, we feed the rationale logits back into the model to recover the label, thus enforcing explanation sufficiency. We evaluate R2D-C on one proprietary triage dataset, as well as public biomedical QA and reasoning datasets. Across model sizes, R2D-C substantially improves rationale–prediction consistency (where stage 1 and stage 2 predictions agree) and sufficiency (where the rationale alone recovers the ground-truth label) over other baselines while matching or modestly improving predictive performance (F1); in several settings R2D-C surpasses

40 \times

larger foundation models. Ablations confirm that the full combination is optimal, maximizing alignment and LLM-as-a-Judge rationale quality. These results demonstrate that confidence-adaptive scheduled sampling and cycle-consistent rationale-to-label training substantially enhance explanation alignment without sacrificing accuracy. Full article

(This article belongs to the Special Issue Generative AI in Medicine: Emerging Applications, Challenges, and Future Directions)

► Show Figures

Figure 1

26 pages, 1714 KB

Open AccessArticle

SV-GEN: Synergizing LLM-Empowered Variable Semantics and Graph Transformers for Vulnerability Detection

by Zhaohui Liu, Haocheng Yang and Wenjie Xie

Future Internet 2026, 18(5), 236; https://doi.org/10.3390/fi18050236 - 27 Apr 2026

Abstract

Deep-learning-based vulnerability detection has made substantial progress, but two limitations remain prominent. Sequence-based methods linearize source code and thus weaken the explicit modeling of control-flow and data-flow dependencies. Graph-based methods preserve program structure, yet conventional graph neural networks still have difficulty capturing long-range [...] Read more.

Deep-learning-based vulnerability detection has made substantial progress, but two limitations remain prominent. Sequence-based methods linearize source code and thus weaken the explicit modeling of control-flow and data-flow dependencies. Graph-based methods preserve program structure, yet conventional graph neural networks still have difficulty capturing long-range interactions in large code property graphs (CPGs). In addition, standard CPGs usually lack explicit variable semantics and security-critical node roles, which limits their ability to represent vulnerability-relevant program behavior. To address these issues, we propose SV-GEN, a vulnerability detection framework that combines large-language-model-driven semantic enhancement with hybrid sequence-graph learning. The novelty of SV-GEN lies in introducing a semantically enriched code property graph, termed Sem-CPG, which augments conventional CPGs with variable semantic roles and security-oriented node labels, and in coupling this representation with an adaptive fusion mechanism over structural and sequential views. Specifically, we use a large language model as an external semantic annotator to assign variable roles and identify source, sink, and sanitizer nodes, and then encode the resulting Sem-CPG with a Graph Transformer while modeling the code sequence with GraphCodeBERT. A learnable gating module is further used to adaptively fuse the graph-level and sequence-level representations for final prediction. Experiments on Devign, ReVeal, and DiverseVul show that SV-GEN achieves competitive or superior overall performance across benchmarks, with particularly strong improvements on the large and highly imbalanced DiverseVul dataset. Full article

(This article belongs to the Special Issue Security of Computer System and Network)

► Show Figures

Figure 1

19 pages, 870 KB

Open AccessArticle

Integrating Unsupervised Learning for the Factual Consistency of Generative Models

by Sindhu Nair and Y. S. Rao

Future Internet 2026, 18(5), 235; https://doi.org/10.3390/fi18050235 - 27 Apr 2026

Abstract

Text summarization involves analyzing large amounts of text, selecting the salient text features, and arranging them coherently. The graph-based TextRank and statistical topic modeling are unsupervised approaches for generating an extractive synopsis. Deep learning models are supervised, data-driven, and pre-trained on a huge [...] Read more.

Text summarization involves analyzing large amounts of text, selecting the salient text features, and arranging them coherently. The graph-based TextRank and statistical topic modeling are unsupervised approaches for generating an extractive synopsis. Deep learning models are supervised, data-driven, and pre-trained on a huge corpus of data, making a significant contribution to automatic text summarization systems. Despite grammatical correctness and coherence, deep learning-based summarization systems are prone to factual inconsistency. This has hindered the applicability of transformer-based summarizers, particularly in critical domains where misleading summarization systems can lead to severe consequences due to their significant social impact. This work proposes an ingenious hybrid hierarchical approach that combines unsupervised approaches, such as the TextRank algorithm and Latent Dirichlet Allocation (LDA)-based summaries, with contemporary transformer-based language models. When validated on three benchmark summarization datasets, empirical results prove that our hybrid hierarchical transformer-based approach mitigates the factual inconsistency problem inherent in abstractive summarization. The improved summary consistency score of the abstractive summaries generated with our multilevel hybrid approach, in comparison to the fine-tuned baseline transformer-based language models, increases trust in transformer-based summarizers. Full article

(This article belongs to the Special Issue Artificial Intelligence (AI) and Natural Language Processing (NLP))

► Show Figures

Figure 1

35 pages, 13771 KB

Open AccessArticle

BioLAMR: A Biomimetically Inspired Large Language Model Adaptation Framework for Automatic Modulation Recognition

by Yubo Mao, Wei Xu, Jijia Sang and Haoan Liu

Biomimetics 2026, 11(4), 288; https://doi.org/10.3390/biomimetics11040288 - 21 Apr 2026

Viewed by 331

Abstract

Automatic modulation recognition (AMR) is increasingly relevant to communication-sensing front ends in robotic and human–robot collaborative systems, where reliable spectrum awareness and adaptive wireless reception are desired. However, existing methods often degrade sharply at low signal-to-noise ratios (SNRs), and large language models (LLMs) [...] Read more.

Automatic modulation recognition (AMR) is increasingly relevant to communication-sensing front ends in robotic and human–robot collaborative systems, where reliable spectrum awareness and adaptive wireless reception are desired. However, existing methods often degrade sharply at low signal-to-noise ratios (SNRs), and large language models (LLMs) are not natively compatible with continuous I/Q signals due to the inherent modality gap. We propose BioLAMR, a GPT-2 adaptation framework for AMR inspired by the auditory system’s parallel time–frequency processing and cortical hierarchy. The framework combines bio-inspired dual-domain feature extraction with parameter-efficient LLM adaptation. BioLAMR includes three components. First, a lightweight dual-domain fusion (LDDF) module extracts complementary time- and frequency-domain features and fuses them through channel and spatial attention. Second, a convolutional embedding module converts continuous I/Q signals into GPT-2-compatible sequences without discrete tokenization. Third, a hierarchical fine-tuning strategy updates only 8.9% of parameters to preserve pretrained knowledge while adapting to modulation recognition. Experiments on the RadioML2016.10a and RadioML2016.10b benchmarks show that BioLAMR achieves overall accuracies of 64.99% and 67.43%, outperforming the strongest competing method by 2.60 and 2.47 percentage points, respectively. Under low-SNR conditions, it reaches 36.78% and 38.14%, the best results among the compared methods. Ablation studies verify the contribution of each component. These results demonstrate that combining dual-domain signal modeling with parameter-efficient GPT-2 adaptation is an effective route to robust AMR in challenging wireless environments. Full article

(This article belongs to the Section Locomotion and Bioinspired Robotics)

► Show Figures

Figure 1

26 pages, 31446 KB

Open AccessArticle

A Training-Free Paradigm for Data-Scarce Maritime Scene Classification Using Vision-Language Models

by Jiabao Wu, Yujie Chen, Wentao Chen, Yicheng Lai, Junjun Li, Xuhang Chen and Wangyu Wu

Sensors 2026, 26(8), 2549; https://doi.org/10.3390/s26082549 - 21 Apr 2026

Viewed by 235

Abstract

Maritime Domain Awareness (MDA) relies heavily on data acquired from high-resolution optical spaceborne sensors; however, processing this massive quantity of sensor data via traditional supervised deep learning is severely bottlenecked by its dependency on exhaustively annotated datasets. Under extreme data scarcity, conventional architectures [...] Read more.

Maritime Domain Awareness (MDA) relies heavily on data acquired from high-resolution optical spaceborne sensors; however, processing this massive quantity of sensor data via traditional supervised deep learning is severely bottlenecked by its dependency on exhaustively annotated datasets. Under extreme data scarcity, conventional architectures suffer severe performance degradation, rendering them impractical for time-critical, zero-day deployments. To overcome this barrier, we propose a training-free inference paradigm that leverages the extensive pre-trained knowledge of Large Vision-Language Models (VLMs). Specifically, we introduce a Domain Knowledge-Enhanced In-Context Learning (DK-ICL) framework coupled with a Macro-Topological Chain-of-Thought (MT-CoT) strategy. This approach bridges the perspective gap between natural images and top–down optical sensor imagery by translating expert remote sensing heuristics into a strict, step-by-step reasoning pipeline. Extensive evaluations demonstrate the substantial efficacy of this framework. Armed with merely 4 visual exemplars per category as in-context triggers, our MT-CoT augmented VLMs outperform traditional models trained under identical scarcity by over 38% in F1-score. Crucially, real-world case studies confirm that this zero-gradient approach maintains robust generalization on unannotated, out-of-distribution coastal clutters, achieving performance parity with data-heavy networks trained on 50 times the data volume. By substituting massive human annotation and GPU optimization with scalable logical deduction, this paradigm establishes a resource-efficient foundation for next-generation intelligent maritime sensing networks. Full article

(This article belongs to the Special Issue Artificial Intelligence-Based Target Recognition and Remote Sensing Data Processing)

► Show Figures

Figure 1

29 pages, 417 KB

Open AccessFeature PaperArticle

An AI-Based Security Architecture for Fraud Detection in Cloud Call Centers for Low-Resource Languages: Arabic as a Use Case

by Pinar Boluk and Hana’a Maratouq

Electronics 2026, 15(8), 1718; https://doi.org/10.3390/electronics15081718 - 18 Apr 2026

Viewed by 161

Abstract

Cloud-based telephony platforms face growing fraud risks including voice phishing (vishing), subscription abuse, and organizational impersonation, with detection being especially challenging in low-resource languages such as Arabic. We present an Artificial Intelligence (AI)-based security architecture for fraud detection in Arabic cloud call centers, [...] Read more.

Cloud-based telephony platforms face growing fraud risks including voice phishing (vishing), subscription abuse, and organizational impersonation, with detection being especially challenging in low-resource languages such as Arabic. We present an Artificial Intelligence (AI)-based security architecture for fraud detection in Arabic cloud call centers, combining onboarding verification, behavioral monitoring, domain-adapted Automatic Speech Recognition (ASR), semantic transcript search, and Large Language Model (LLM)-based entity verification. The domain-adapted Langa ASR model achieves a Word Error Rate (WER) of 41.0% and Character Error Rate (CER) of 18.2%, outperforming all evaluated commercial baselines. LLM-based entity extraction with multi-call consensus achieves 97.3% company-name accuracy (Generative Pre-trained Transformer 4, GPT-4) and 92.0% in the cost-effective deployed configuration (GPT-3.5 with log-probability filtering). Evaluated on production data from a Middle East and North Africa (MENA)-region provider spanning more than 1000 accounts, the pipeline flagged 47 accounts of which 41 were confirmed fraudulent (directly observed precision 87.2%, 95% confidence interval (CI): 74.3–95.2%; estimated recall 51–82% under conservative base-rate assumptions—not directly measured), providing evidence for the viability of a unified, threat-model-driven architecture for low-resource telephony fraud detection. Full article

(This article belongs to the Special Issue AI-Enhanced Security: Advancing Threat Detection and Defense)

► Show Figures

Figure 1

14 pages, 1774 KB

Open AccessArticle

Automated Classification of Occupational Accident Texts Using Large Language Models: A Pilot Study

by Hajime Ando, Ryutaro Matsugaki, Sakumi Yamakawa and Akira Ogami

Occup. Health 2026, 1(2), 16; https://doi.org/10.3390/occuphealth1020016 - 17 Apr 2026

Viewed by 366

Abstract

Same-level falls are the most frequent occupational accidents, yet traditional manual analysis of accident reports is labor-intensive and limits large-scale prevention strategies. In this pilot study, we aimed to evaluate the accuracy of using large language models (LLMs) to automate the classification of [...] Read more.

Same-level falls are the most frequent occupational accidents, yet traditional manual analysis of accident reports is labor-intensive and limits large-scale prevention strategies. In this pilot study, we aimed to evaluate the accuracy of using large language models (LLMs) to automate the classification of occupational accident text data without task-specific pretraining. We analyzed data from 2619 same-level-fall-related injury cases, using expert manual classification as the reference standard. Four models—GPT-4o mini, GPT-4.1 mini, GPT-4.1, and o4-mini—were compared using accuracy and Cohen’s kappa. The o4-mini model demonstrated the highest performance, showing statistical superiority in the complex “causal agent” category with 72.8% accuracy. For other classification tasks, the top models achieved accuracies of 82–92%, with Cohen’s kappa coefficients > 0.7, indicating substantial agreement with expert judgments. These findings suggest that LLMs can classify occupational accident text with substantial agreement with the expert-derived reference standard in this dataset. This automated approach enables efficient, high-frequency analysis of large datasets, offering a promising tool for large-scale occupational accident surveillance and screening. Full article

► Show Figures

Figure 1

21 pages, 12849 KB

Open AccessArticle

VETA-CLIP: Lightweight Video Adaptation with Efficient Spatio-Temporal Attention and Variation Loss

by Jing Huang and Jiaxin Liao

Electronics 2026, 15(8), 1701; https://doi.org/10.3390/electronics15081701 - 17 Apr 2026

Viewed by 152

Abstract

Full fine-tuning of large-scale vision-language models for video action recognition incurs prohibitive computational cost and often degrades pre-trained spatial representations. To address this, we propose VETA-CLIP, a Video Efficient Temporal Adaptation framework that enhances temporal modeling while preserving cross-modal alignment. By incorporating lightweight [...] Read more.

Full fine-tuning of large-scale vision-language models for video action recognition incurs prohibitive computational cost and often degrades pre-trained spatial representations. To address this, we propose VETA-CLIP, a Video Efficient Temporal Adaptation framework that enhances temporal modeling while preserving cross-modal alignment. By incorporating lightweight adapters into a frozen backbone, VETA-CLIP introduces only 3.55M trainable parameters (a 98% reduction compared to full fine-tuning). Our approach features two key innovations: (1) an Efficient Spatio-Temporal Attention (ESTA) mechanism with a parameter-free boundary replication temporal shift (BRTS) module, which explicitly decouples spatial and temporal attention heads to capture inter-frame dynamics while minimizing disruption to the pre-trained spatial representations; and (2) a novel Variation Loss that maximizes both local inter-frame differences and global temporal variance, encouraging the model to focus on action-related changes rather than static backgrounds. Extensive experiments on HMDB-51, UCF-101, and Something-Something v2 demonstrate that VETA-CLIP achieves competitive performance across zero-shot, base-to-novel, and few-shot protocols, while and remains competitive on the Kinetics-400 dataset. Notably, our eight-frame variant requires only 4.7 GB of peak GPU memory and 2.47 ms of inference per video, demonstrating exceptional computational efficiency alongside consistent accuracy gains. Full article

(This article belongs to the Section Artificial Intelligence)

► Show Figures

Figure 1

20 pages, 10822 KB

Open AccessArticle

T-CASP: Time-Aware Continual Aspect Semantic-Driven Incremental Pretraining

by Shuai Feng, Pan Su, Zaishan Qi and Liran Yang

Appl. Sci. 2026, 16(8), 3837; https://doi.org/10.3390/app16083837 - 15 Apr 2026

Viewed by 300

Abstract

With the rapid advancement of medical foundation models, their deployment in clinical practice is increasingly required. However, privacy constraints of hospital-specific data make large-scale retraining infeasible, limiting model adaptability. To address this issue, we propose a Continual Aspect Semantic-driven Incremental Pretraining (CASP) framework, [...] Read more.

With the rapid advancement of medical foundation models, their deployment in clinical practice is increasingly required. However, privacy constraints of hospital-specific data make large-scale retraining infeasible, limiting model adaptability. To address this issue, we propose a Continual Aspect Semantic-driven Incremental Pretraining (CASP) framework, which enables efficient adaptation of foundation models to private data, and the pre-trained models can be effectively applied to downstream tasks. In this paper, we focus on fundus fluorescein angiography (FFA) in ophthalmology as a representative application scenario to validate the proposed approach. FFA is a critical imaging modality for retinal disease diagnosis, as it is able to capture dynamic vascular changes across multiple angiographic phases. However, most existing learning-based methods treat FFA images statically and independently, failing to exploit the rich temporal and phase-specific semantics that are essential for accurate diagnosis. In this paper, a Time-aware Continual Aspect Semantic-driven incremental Pretraining (T-CASP) framework is proposed for FFA sequences. To compensate for limited temporal descriptions in clinical reports, large language models are first used to construct a temporal disease knowledge dictionary with phase-specific semantic descriptions. Based on this dictionary, a disease correlation matrix is injected into contrastive learning to guide fine-grained image–text alignment. A multi-layer gated residual adapter is further designed to capture phase-level semantics and enable knowledge transfer across learning stages through phase-wise continual pretraining. Extensive experiments demonstrate that T-CASP effectively models dynamic temporal semantics in FFA sequences, yielding consistent performance improvements over time-unaware and static baselines in retinal disease recognition. By explicitly integrating phase-wise temporal knowledge and continual semantic refinement, T-CASP provides a clinically consistent and effective solution for temporal FFA analysis, enhancing robustness and diagnostic accuracy in ophthalmic multimodal learning. Full article

(This article belongs to the Section Computing and Artificial Intelligence)

► Show Figures

Figure 1

20 pages, 5504 KB

Open AccessArticle

A Large Language Model for Traffic Flow Prediction Based on Stationary Wavelet Transform and Graph Convolutional Networks

by Xin Wang, Gang Liu, Jing He, Xiangbing Zhou and Zhiyong Luo

ISPRS Int. J. Geo-Inf. 2026, 15(4), 166; https://doi.org/10.3390/ijgi15040166 - 11 Apr 2026

Viewed by 415

Abstract

With the rapid development of Intelligent Transportation Systems (ITSs), traffic prediction, a crucial component of ITSs, has garnered growing scholarly attention. The appli-cation of deep learning into traffic prediction has emerged as a prominent research direction, especially amid the rapid advancement of pretrained [...] Read more.

With the rapid development of Intelligent Transportation Systems (ITSs), traffic prediction, a crucial component of ITSs, has garnered growing scholarly attention. The appli-cation of deep learning into traffic prediction has emerged as a prominent research direction, especially amid the rapid advancement of pretrained large language models (LLMs), which offer substantial benefits in time-series analysis through cross-modal knowledge transfer. In response to this advancement, this study introduces an innovative model for traffic flow prediction, designated as WGLLM. To capture spatiotemporal characteristics inherent in traffic flow data, this model incorporates a sequence embedding layer constructed on the stationary wavelet transform (SWT) and long short-term memory (LSTM), in conjunction with a spatial embedding layer founded on graph convolutional networks (GCNs). Additionally, a fully connected layer is utilized to integrate embeddings into the LLMs for comprehensive global dependency analysis. To verify the effectiveness of the proposed approach, experiments were carried out on two real traffic flow datasets. The experimental results demonstrate that WGLLM achieves superior predictive performance compared to multiple mainstream baseline models, accompanied by a significant enhancement in prediction accuracy. Full article

► Show Figures

Figure 1

18 pages, 439 KB

Open AccessArticle

Understanding and Predicting Tourist Behavior Through Large Language Models

by Anna Dalla Vecchia, Simone Mattioli, Sara Migliorini and Elisa Quintarelli

Big Data Cogn. Comput. 2026, 10(4), 117; https://doi.org/10.3390/bdcc10040117 - 11 Apr 2026

Viewed by 411

Abstract

Understanding and predicting how tourists move through a city is a challenging task, as it involves a complex interplay of spatial, temporal, and social factors. Traditional recommender systems often rely on structured data, trying to capture the nature of the problem. However, recent [...] Read more.

Understanding and predicting how tourists move through a city is a challenging task, as it involves a complex interplay of spatial, temporal, and social factors. Traditional recommender systems often rely on structured data, trying to capture the nature of the problem. However, recent advances in Large Language Models (LLMs) open new possibilities for reasoning over richer, text-based representations of user context, even without a dedicated pre-training phase. In this study, we investigate the potential of LLMs to interpret and predict tourist movements in a real-world application scenario involving tourist visits to Verona, a municipality in Northern Italy, between 2014 and 2023. We propose an incremental prompt engineering approach that gradually enriches the model input, from spatial features alone to richer behavioral information, including visit histories, time information, and user cluster patterns. The approach is evaluated using six open-source models, enabling us to compare their accuracy and efficiency across various levels of contextual enrichment. The results provide a first insight about the abilities of LLMs to incorporate spatio-temporal contextual factors, thus improving predictions, while maintaining computational efficiency. The analysis of the model-generated explanations completes the picture by adding an interpretability dimension that most existing next-PoI prediction solutions lack. Overall, the study demonstrates the potential of LLMs to integrate multiple contextual dimensions in tourism mobility, highlighting the possibility of a more text-oriented, adaptive, and explainable T-RS. Full article

(This article belongs to the Section Large Language Models and Embodied Intelligence)

► Show Figures

Figure 1

6 pages, 450 KB

Open AccessProceeding Paper

Class Entity Identification Based on Large Language Models: A Choice Between Classification and Generation

by Eric Jui-Lin Lu and Cheng-Hao Yang

Eng. Proc. 2026, 134(1), 42; https://doi.org/10.3390/engproc2026134042 - 10 Apr 2026

Viewed by 238

Abstract

Large language models (LLMs) have been widely applied to knowledge graph question answering (KGQA) systems. Recent Text-to-SPARQL studies have demonstrated that generation performance can achieve an F1 score exceeding 90%. Further error analysis has categorized common errors into entity translation errors, entity position [...] Read more.

Large language models (LLMs) have been widely applied to knowledge graph question answering (KGQA) systems. Recent Text-to-SPARQL studies have demonstrated that generation performance can achieve an F1 score exceeding 90%. Further error analysis has categorized common errors into entity translation errors, entity position errors, and resource description framework (RDF) triple-count errors, with the latter accounting for 24% of all errors. Notably, nearly 90% of RDF triple-count errors occur when the triples involve class entities. Previous research has shown that incorporating prompts can effectively enhance model performance. Based on the results, we predicted whether a question contains a class entity and the number of RDF triples in the corresponding query to reduce RDF triple-count errors in large language models by providing precise task-related information through prompt design. Since both strategies are classification-oriented, two implementation paradigms were established: traditional classification architectures and generative modeling. They were compared in terms of performance. For classification-based architectures, we employed Bidirectional Encoder Representations from Transformers (BERT) and the Robustly Optimized BERT Approach (RoBERTa) to obtain question embeddings for classification. For the generative approach, we adopted the Instruction-Tuned Text-to-Text Transfer Transformer (Flan-T5). Experimental results show that the generative model slightly outperforms conventional classification architectures, indicating that generative approaches can achieve higher prediction accuracy and provide more reliable information without the need for additional complex encoder designs, thereby improving the overall quality of Text-to-SPARQL generation. Full article

(This article belongs to the Proceedings of The 7th Eurasia Conference on IoT, Communication and Engineering 2025 (ECICE 2025))

► Show Figures

Figure 1

19 pages, 9603 KB

Open AccessArticle

Understanding Modality-Specific Vulnerabilities in Vision–Language Models Under Adversarial Attacks

by Maisha Binte Rashid and Pablo Rivas

AI 2026, 7(4), 135; https://doi.org/10.3390/ai7040135 - 9 Apr 2026

Viewed by 472

Abstract

Vision–language models (VLMs), such as Contrastive Language–Image Pretraining (CLIP), are increasingly deployed in real-world applications, including content moderation, misinformation detection, and fraud analysis, making their robustness to adversarial attacks a critical concern. While adversarial robustness has been widely studied in unimodal models, modality-specific [...] Read more.

Vision–language models (VLMs), such as Contrastive Language–Image Pretraining (CLIP), are increasingly deployed in real-world applications, including content moderation, misinformation detection, and fraud analysis, making their robustness to adversarial attacks a critical concern. While adversarial robustness has been widely studied in unimodal models, modality-specific vulnerabilities in multimodal models remain underexplored. In this work, we analyze CLIP by applying gradient-based adversarial attacks to its vision and language modalities, both independently and jointly, and evaluating performance on two multimodal classification benchmarks: the Facebook Hateful Memes dataset and a large-scale Suspicious Car Parts dataset. Using Fast Gradient Sign Method (FGSM) and Projected Gradient Descent (PGD) attacks along with multiple adversarial retraining strategies, we show that adversarial perturbations on the image modality consistently cause the most severe and unstable performance degradation. These results demonstrate that the vision modality is the primary vulnerability in CLIP, highlighting the need for modality-specific defense strategies that focus more on the weaker modality in multimodal systems. Full article

(This article belongs to the Section AI Systems: Theory and Applications)

► Show Figures

Graphical abstract

26 pages, 2454 KB

Open AccessArticle

Evolving LLMs from Next-Token Prediction to Multi-Token Prediction via Self-Distillation

by Yang Xu and Wanxiang Che

Electronics 2026, 15(7), 1533; https://doi.org/10.3390/electronics15071533 - 6 Apr 2026

Viewed by 689

Abstract

Mainstream Large Language Models (LLMs) work under the paradigm of Next-Token Prediction (NTP). Multi-Token Prediction (MTP) is motivated by higher decoding efficiency, extending NTP to enable LLMs to draft multiple tokens during each forward pass. However, existing MTP approaches pretrain MTP along with [...] Read more.

Mainstream Large Language Models (LLMs) work under the paradigm of Next-Token Prediction (NTP). Multi-Token Prediction (MTP) is motivated by higher decoding efficiency, extending NTP to enable LLMs to draft multiple tokens during each forward pass. However, existing MTP approaches pretrain MTP along with the target LLM, making it difficult to unlock MTP for LLMs without official support. In this work, we propose a post-hoc approach to training an MTP module for a target LLM, providing an efficient way to evolve the LLM from NTP to MTP. The proposed approach features two main characteristics. (1) No changes to the target LLM, since it is frozen during MTP training. (2) Efficient MTP training via self-distillation from the target LLM’s native NTP capability. Results show that our approach can post-hoc train a performant MTP module via lightweight pretraining. Full article

(This article belongs to the Special Issue Innovative Applications of Large Language Models in Natural Language Processing (NLP))

► Show Figures

Figure 1

15 pages, 1901 KB

Open AccessArticle

DW-ReID: Vision–Language Learning for Person Re-Identification Under Diverse Weather Conditions

by Lei Cai, Yuying Liang, Bin Wang, Hexi Li, Jinquan Yang and Tao Zhu

Sensors 2026, 26(7), 2263; https://doi.org/10.3390/s26072263 - 6 Apr 2026

Viewed by 561

Abstract

Person re-identification (ReID) under diverse weather conditions remains a critical yet insufficiently explored problem. Most existing ReID approaches are developed and benchmarked on clear-weather datasets, resulting in significant performance degradation when deployed in rainy, snowy, or hazy environments. Conventional image restoration methods, typically [...] Read more.

Person re-identification (ReID) under diverse weather conditions remains a critical yet insufficiently explored problem. Most existing ReID approaches are developed and benchmarked on clear-weather datasets, resulting in significant performance degradation when deployed in rainy, snowy, or hazy environments. Conventional image restoration methods, typically optimized for low-level image quality metrics, are often misaligned with the objectives of high-level identity discrimination and thus fail to improve the person ReID performance. To address these limitations, we propose DW-ReID, a unified framework that integrates weather-degraded image restoration with person re-identification tasks. The proposed DW-ReID is built upon a large-scale Contrastive Language-Image Pre-training (CLIP) model and achieved by a two-stage training paradigm. In the first stage, a set of learnable text prompts is optimized to construct identity-specific ambiguous descriptions for each person’s identity. In the second stage, the optimized text descriptions, together with a frozen text encoder, provide language supervision to jointly train a weather encoder, an image restorer, and a ReID encoder in an end-to-end manner. The experimental results on two our contributed synthetic datasets consistently demonstrate the effectiveness and superior performance of the proposed DW-ReID method. Full article

(This article belongs to the Topic Applied Computer Vision and Pattern Recognition: 2nd Edition)

► Show Figures

Figure 1

Search Results (460)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (460)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI