Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Article Types

Countries / Regions

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Search Results (460)

Search Parameters:
Keywords = pretrained large language models

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
24 pages, 453 KB  
Article
Reason2Decide-C: Adaptive Cycle-Consistent Training for Clinical Rationales
by H M Quamran Hasan, Housam Khalifa Bashier Babiker, Mi-Young Kim and Randy Goebel
Computers 2026, 15(5), 279; https://doi.org/10.3390/computers15050279 - 27 Apr 2026
Abstract
Large Language Models (LLMs) used for clinical decision support must not only make accurate predictions but also generate rationales that are consistent with, and sufficient for, those predictions. Building on Reason2Decide, a two-stage rationale-driven multi-task framework, we propose Reason2Decide-C (R2D-C, where C denotes [...] Read more.
Large Language Models (LLMs) used for clinical decision support must not only make accurate predictions but also generate rationales that are consistent with, and sufficient for, those predictions. Building on Reason2Decide, a two-stage rationale-driven multi-task framework, we propose Reason2Decide-C (R2D-C, where C denotes cycle consistency), which augments Reason2Decide’s stage 2 training with confidence-adaptive scheduled sampling and cycle-consistent rationale-to-label training. In stage 1, we pretrain our model on rationale generation. In stage 2, we jointlytrain on label prediction and rationale generation, gradually replacing gold labels with model-predicted labels based on confidence. Simultaneously, we feed the rationale logits back into the model to recover the label, thus enforcing explanation sufficiency. We evaluate R2D-C on one proprietary triage dataset, as well as public biomedical QA and reasoning datasets. Across model sizes, R2D-C substantially improves rationale–prediction consistency (where stage 1 and stage 2 predictions agree) and sufficiency (where the rationale alone recovers the ground-truth label) over other baselines while matching or modestly improving predictive performance (F1); in several settings R2D-C surpasses 40× larger foundation models. Ablations confirm that the full combination is optimal, maximizing alignment and LLM-as-a-Judge rationale quality. These results demonstrate that confidence-adaptive scheduled sampling and cycle-consistent rationale-to-label training substantially enhance explanation alignment without sacrificing accuracy. Full article
Show Figures

Figure 1

26 pages, 1714 KB  
Article
SV-GEN: Synergizing LLM-Empowered Variable Semantics and Graph Transformers for Vulnerability Detection
by Zhaohui Liu, Haocheng Yang and Wenjie Xie
Future Internet 2026, 18(5), 236; https://doi.org/10.3390/fi18050236 - 27 Apr 2026
Abstract
Deep-learning-based vulnerability detection has made substantial progress, but two limitations remain prominent. Sequence-based methods linearize source code and thus weaken the explicit modeling of control-flow and data-flow dependencies. Graph-based methods preserve program structure, yet conventional graph neural networks still have difficulty capturing long-range [...] Read more.
Deep-learning-based vulnerability detection has made substantial progress, but two limitations remain prominent. Sequence-based methods linearize source code and thus weaken the explicit modeling of control-flow and data-flow dependencies. Graph-based methods preserve program structure, yet conventional graph neural networks still have difficulty capturing long-range interactions in large code property graphs (CPGs). In addition, standard CPGs usually lack explicit variable semantics and security-critical node roles, which limits their ability to represent vulnerability-relevant program behavior. To address these issues, we propose SV-GEN, a vulnerability detection framework that combines large-language-model-driven semantic enhancement with hybrid sequence-graph learning. The novelty of SV-GEN lies in introducing a semantically enriched code property graph, termed Sem-CPG, which augments conventional CPGs with variable semantic roles and security-oriented node labels, and in coupling this representation with an adaptive fusion mechanism over structural and sequential views. Specifically, we use a large language model as an external semantic annotator to assign variable roles and identify source, sink, and sanitizer nodes, and then encode the resulting Sem-CPG with a Graph Transformer while modeling the code sequence with GraphCodeBERT. A learnable gating module is further used to adaptively fuse the graph-level and sequence-level representations for final prediction. Experiments on Devign, ReVeal, and DiverseVul show that SV-GEN achieves competitive or superior overall performance across benchmarks, with particularly strong improvements on the large and highly imbalanced DiverseVul dataset. Full article
(This article belongs to the Special Issue Security of Computer System and Network)
Show Figures

Figure 1

19 pages, 870 KB  
Article
Integrating Unsupervised Learning for the Factual Consistency of Generative Models
by Sindhu Nair and Y. S. Rao
Future Internet 2026, 18(5), 235; https://doi.org/10.3390/fi18050235 - 27 Apr 2026
Abstract
Text summarization involves analyzing large amounts of text, selecting the salient text features, and arranging them coherently. The graph-based TextRank and statistical topic modeling are unsupervised approaches for generating an extractive synopsis. Deep learning models are supervised, data-driven, and pre-trained on a huge [...] Read more.
Text summarization involves analyzing large amounts of text, selecting the salient text features, and arranging them coherently. The graph-based TextRank and statistical topic modeling are unsupervised approaches for generating an extractive synopsis. Deep learning models are supervised, data-driven, and pre-trained on a huge corpus of data, making a significant contribution to automatic text summarization systems. Despite grammatical correctness and coherence, deep learning-based summarization systems are prone to factual inconsistency. This has hindered the applicability of transformer-based summarizers, particularly in critical domains where misleading summarization systems can lead to severe consequences due to their significant social impact. This work proposes an ingenious hybrid hierarchical approach that combines unsupervised approaches, such as the TextRank algorithm and Latent Dirichlet Allocation (LDA)-based summaries, with contemporary transformer-based language models. When validated on three benchmark summarization datasets, empirical results prove that our hybrid hierarchical transformer-based approach mitigates the factual inconsistency problem inherent in abstractive summarization. The improved summary consistency score of the abstractive summaries generated with our multilevel hybrid approach, in comparison to the fine-tuned baseline transformer-based language models, increases trust in transformer-based summarizers. Full article
(This article belongs to the Special Issue Artificial Intelligence (AI) and Natural Language Processing (NLP))
Show Figures

Figure 1

35 pages, 13771 KB  
Article
BioLAMR: A Biomimetically Inspired Large Language Model Adaptation Framework for Automatic Modulation Recognition
by Yubo Mao, Wei Xu, Jijia Sang and Haoan Liu
Biomimetics 2026, 11(4), 288; https://doi.org/10.3390/biomimetics11040288 - 21 Apr 2026
Viewed by 331
Abstract
Automatic modulation recognition (AMR) is increasingly relevant to communication-sensing front ends in robotic and human–robot collaborative systems, where reliable spectrum awareness and adaptive wireless reception are desired. However, existing methods often degrade sharply at low signal-to-noise ratios (SNRs), and large language models (LLMs) [...] Read more.
Automatic modulation recognition (AMR) is increasingly relevant to communication-sensing front ends in robotic and human–robot collaborative systems, where reliable spectrum awareness and adaptive wireless reception are desired. However, existing methods often degrade sharply at low signal-to-noise ratios (SNRs), and large language models (LLMs) are not natively compatible with continuous I/Q signals due to the inherent modality gap. We propose BioLAMR, a GPT-2 adaptation framework for AMR inspired by the auditory system’s parallel time–frequency processing and cortical hierarchy. The framework combines bio-inspired dual-domain feature extraction with parameter-efficient LLM adaptation. BioLAMR includes three components. First, a lightweight dual-domain fusion (LDDF) module extracts complementary time- and frequency-domain features and fuses them through channel and spatial attention. Second, a convolutional embedding module converts continuous I/Q signals into GPT-2-compatible sequences without discrete tokenization. Third, a hierarchical fine-tuning strategy updates only 8.9% of parameters to preserve pretrained knowledge while adapting to modulation recognition. Experiments on the RadioML2016.10a and RadioML2016.10b benchmarks show that BioLAMR achieves overall accuracies of 64.99% and 67.43%, outperforming the strongest competing method by 2.60 and 2.47 percentage points, respectively. Under low-SNR conditions, it reaches 36.78% and 38.14%, the best results among the compared methods. Ablation studies verify the contribution of each component. These results demonstrate that combining dual-domain signal modeling with parameter-efficient GPT-2 adaptation is an effective route to robust AMR in challenging wireless environments. Full article
(This article belongs to the Section Locomotion and Bioinspired Robotics)
Show Figures

Figure 1

26 pages, 31446 KB  
Article
A Training-Free Paradigm for Data-Scarce Maritime Scene Classification Using Vision-Language Models
by Jiabao Wu, Yujie Chen, Wentao Chen, Yicheng Lai, Junjun Li, Xuhang Chen and Wangyu Wu
Sensors 2026, 26(8), 2549; https://doi.org/10.3390/s26082549 - 21 Apr 2026
Viewed by 235
Abstract
Maritime Domain Awareness (MDA) relies heavily on data acquired from high-resolution optical spaceborne sensors; however, processing this massive quantity of sensor data via traditional supervised deep learning is severely bottlenecked by its dependency on exhaustively annotated datasets. Under extreme data scarcity, conventional architectures [...] Read more.
Maritime Domain Awareness (MDA) relies heavily on data acquired from high-resolution optical spaceborne sensors; however, processing this massive quantity of sensor data via traditional supervised deep learning is severely bottlenecked by its dependency on exhaustively annotated datasets. Under extreme data scarcity, conventional architectures suffer severe performance degradation, rendering them impractical for time-critical, zero-day deployments. To overcome this barrier, we propose a training-free inference paradigm that leverages the extensive pre-trained knowledge of Large Vision-Language Models (VLMs). Specifically, we introduce a Domain Knowledge-Enhanced In-Context Learning (DK-ICL) framework coupled with a Macro-Topological Chain-of-Thought (MT-CoT) strategy. This approach bridges the perspective gap between natural images and top–down optical sensor imagery by translating expert remote sensing heuristics into a strict, step-by-step reasoning pipeline. Extensive evaluations demonstrate the substantial efficacy of this framework. Armed with merely 4 visual exemplars per category as in-context triggers, our MT-CoT augmented VLMs outperform traditional models trained under identical scarcity by over 38% in F1-score. Crucially, real-world case studies confirm that this zero-gradient approach maintains robust generalization on unannotated, out-of-distribution coastal clutters, achieving performance parity with data-heavy networks trained on 50 times the data volume. By substituting massive human annotation and GPU optimization with scalable logical deduction, this paradigm establishes a resource-efficient foundation for next-generation intelligent maritime sensing networks. Full article
Show Figures

Figure 1

29 pages, 417 KB  
Article
An AI-Based Security Architecture for Fraud Detection in Cloud Call Centers for Low-Resource Languages: Arabic as a Use Case
by Pinar Boluk and Hana’a Maratouq
Electronics 2026, 15(8), 1718; https://doi.org/10.3390/electronics15081718 - 18 Apr 2026
Viewed by 161
Abstract
Cloud-based telephony platforms face growing fraud risks including voice phishing (vishing), subscription abuse, and organizational impersonation, with detection being especially challenging in low-resource languages such as Arabic. We present an Artificial Intelligence (AI)-based security architecture for fraud detection in Arabic cloud call centers, [...] Read more.
Cloud-based telephony platforms face growing fraud risks including voice phishing (vishing), subscription abuse, and organizational impersonation, with detection being especially challenging in low-resource languages such as Arabic. We present an Artificial Intelligence (AI)-based security architecture for fraud detection in Arabic cloud call centers, combining onboarding verification, behavioral monitoring, domain-adapted Automatic Speech Recognition (ASR), semantic transcript search, and Large Language Model (LLM)-based entity verification. The domain-adapted Langa ASR model achieves a Word Error Rate (WER) of 41.0% and Character Error Rate (CER) of 18.2%, outperforming all evaluated commercial baselines. LLM-based entity extraction with multi-call consensus achieves 97.3% company-name accuracy (Generative Pre-trained Transformer 4, GPT-4) and 92.0% in the cost-effective deployed configuration (GPT-3.5 with log-probability filtering). Evaluated on production data from a Middle East and North Africa (MENA)-region provider spanning more than 1000 accounts, the pipeline flagged 47 accounts of which 41 were confirmed fraudulent (directly observed precision 87.2%, 95% confidence interval (CI): 74.3–95.2%; estimated recall 51–82% under conservative base-rate assumptions—not directly measured), providing evidence for the viability of a unified, threat-model-driven architecture for low-resource telephony fraud detection. Full article
(This article belongs to the Special Issue AI-Enhanced Security: Advancing Threat Detection and Defense)
Show Figures

Figure 1

14 pages, 1774 KB  
Article
Automated Classification of Occupational Accident Texts Using Large Language Models: A Pilot Study
by Hajime Ando, Ryutaro Matsugaki, Sakumi Yamakawa and Akira Ogami
Occup. Health 2026, 1(2), 16; https://doi.org/10.3390/occuphealth1020016 - 17 Apr 2026
Viewed by 366
Abstract
Same-level falls are the most frequent occupational accidents, yet traditional manual analysis of accident reports is labor-intensive and limits large-scale prevention strategies. In this pilot study, we aimed to evaluate the accuracy of using large language models (LLMs) to automate the classification of [...] Read more.
Same-level falls are the most frequent occupational accidents, yet traditional manual analysis of accident reports is labor-intensive and limits large-scale prevention strategies. In this pilot study, we aimed to evaluate the accuracy of using large language models (LLMs) to automate the classification of occupational accident text data without task-specific pretraining. We analyzed data from 2619 same-level-fall-related injury cases, using expert manual classification as the reference standard. Four models—GPT-4o mini, GPT-4.1 mini, GPT-4.1, and o4-mini—were compared using accuracy and Cohen’s kappa. The o4-mini model demonstrated the highest performance, showing statistical superiority in the complex “causal agent” category with 72.8% accuracy. For other classification tasks, the top models achieved accuracies of 82–92%, with Cohen’s kappa coefficients > 0.7, indicating substantial agreement with expert judgments. These findings suggest that LLMs can classify occupational accident text with substantial agreement with the expert-derived reference standard in this dataset. This automated approach enables efficient, high-frequency analysis of large datasets, offering a promising tool for large-scale occupational accident surveillance and screening. Full article
Show Figures

Figure 1

21 pages, 12849 KB  
Article
VETA-CLIP: Lightweight Video Adaptation with Efficient Spatio-Temporal Attention and Variation Loss
by Jing Huang and Jiaxin Liao
Electronics 2026, 15(8), 1701; https://doi.org/10.3390/electronics15081701 - 17 Apr 2026
Viewed by 152
Abstract
Full fine-tuning of large-scale vision-language models for video action recognition incurs prohibitive computational cost and often degrades pre-trained spatial representations. To address this, we propose VETA-CLIP, a Video Efficient Temporal Adaptation framework that enhances temporal modeling while preserving cross-modal alignment. By incorporating lightweight [...] Read more.
Full fine-tuning of large-scale vision-language models for video action recognition incurs prohibitive computational cost and often degrades pre-trained spatial representations. To address this, we propose VETA-CLIP, a Video Efficient Temporal Adaptation framework that enhances temporal modeling while preserving cross-modal alignment. By incorporating lightweight adapters into a frozen backbone, VETA-CLIP introduces only 3.55M trainable parameters (a 98% reduction compared to full fine-tuning). Our approach features two key innovations: (1) an Efficient Spatio-Temporal Attention (ESTA) mechanism with a parameter-free boundary replication temporal shift (BRTS) module, which explicitly decouples spatial and temporal attention heads to capture inter-frame dynamics while minimizing disruption to the pre-trained spatial representations; and (2) a novel Variation Loss that maximizes both local inter-frame differences and global temporal variance, encouraging the model to focus on action-related changes rather than static backgrounds. Extensive experiments on HMDB-51, UCF-101, and Something-Something v2 demonstrate that VETA-CLIP achieves competitive performance across zero-shot, base-to-novel, and few-shot protocols, while and remains competitive on the Kinetics-400 dataset. Notably, our eight-frame variant requires only 4.7 GB of peak GPU memory and 2.47 ms of inference per video, demonstrating exceptional computational efficiency alongside consistent accuracy gains. Full article
(This article belongs to the Section Artificial Intelligence)
Show Figures

Figure 1

20 pages, 10822 KB  
Article
T-CASP: Time-Aware Continual Aspect Semantic-Driven Incremental Pretraining
by Shuai Feng, Pan Su, Zaishan Qi and Liran Yang
Appl. Sci. 2026, 16(8), 3837; https://doi.org/10.3390/app16083837 - 15 Apr 2026
Viewed by 300
Abstract
With the rapid advancement of medical foundation models, their deployment in clinical practice is increasingly required. However, privacy constraints of hospital-specific data make large-scale retraining infeasible, limiting model adaptability. To address this issue, we propose a Continual Aspect Semantic-driven Incremental Pretraining (CASP) framework, [...] Read more.
With the rapid advancement of medical foundation models, their deployment in clinical practice is increasingly required. However, privacy constraints of hospital-specific data make large-scale retraining infeasible, limiting model adaptability. To address this issue, we propose a Continual Aspect Semantic-driven Incremental Pretraining (CASP) framework, which enables efficient adaptation of foundation models to private data, and the pre-trained models can be effectively applied to downstream tasks. In this paper, we focus on fundus fluorescein angiography (FFA) in ophthalmology as a representative application scenario to validate the proposed approach. FFA is a critical imaging modality for retinal disease diagnosis, as it is able to capture dynamic vascular changes across multiple angiographic phases. However, most existing learning-based methods treat FFA images statically and independently, failing to exploit the rich temporal and phase-specific semantics that are essential for accurate diagnosis. In this paper, a Time-aware Continual Aspect Semantic-driven incremental Pretraining (T-CASP) framework is proposed for FFA sequences. To compensate for limited temporal descriptions in clinical reports, large language models are first used to construct a temporal disease knowledge dictionary with phase-specific semantic descriptions. Based on this dictionary, a disease correlation matrix is injected into contrastive learning to guide fine-grained image–text alignment. A multi-layer gated residual adapter is further designed to capture phase-level semantics and enable knowledge transfer across learning stages through phase-wise continual pretraining. Extensive experiments demonstrate that T-CASP effectively models dynamic temporal semantics in FFA sequences, yielding consistent performance improvements over time-unaware and static baselines in retinal disease recognition. By explicitly integrating phase-wise temporal knowledge and continual semantic refinement, T-CASP provides a clinically consistent and effective solution for temporal FFA analysis, enhancing robustness and diagnostic accuracy in ophthalmic multimodal learning. Full article
(This article belongs to the Section Computing and Artificial Intelligence)
Show Figures

Figure 1

20 pages, 5504 KB  
Article
A Large Language Model for Traffic Flow Prediction Based on Stationary Wavelet Transform and Graph Convolutional Networks
by Xin Wang, Gang Liu, Jing He, Xiangbing Zhou and Zhiyong Luo
ISPRS Int. J. Geo-Inf. 2026, 15(4), 166; https://doi.org/10.3390/ijgi15040166 - 11 Apr 2026
Viewed by 415
Abstract
With the rapid development of Intelligent Transportation Systems (ITSs), traffic prediction, a crucial component of ITSs, has garnered growing scholarly attention. The appli-cation of deep learning into traffic prediction has emerged as a prominent research direction, especially amid the rapid advancement of pretrained [...] Read more.
With the rapid development of Intelligent Transportation Systems (ITSs), traffic prediction, a crucial component of ITSs, has garnered growing scholarly attention. The appli-cation of deep learning into traffic prediction has emerged as a prominent research direction, especially amid the rapid advancement of pretrained large language models (LLMs), which offer substantial benefits in time-series analysis through cross-modal knowledge transfer. In response to this advancement, this study introduces an innovative model for traffic flow prediction, designated as WGLLM. To capture spatiotemporal characteristics inherent in traffic flow data, this model incorporates a sequence embedding layer constructed on the stationary wavelet transform (SWT) and long short-term memory (LSTM), in conjunction with a spatial embedding layer founded on graph convolutional networks (GCNs). Additionally, a fully connected layer is utilized to integrate embeddings into the LLMs for comprehensive global dependency analysis. To verify the effectiveness of the proposed approach, experiments were carried out on two real traffic flow datasets. The experimental results demonstrate that WGLLM achieves superior predictive performance compared to multiple mainstream baseline models, accompanied by a significant enhancement in prediction accuracy. Full article
Show Figures

Figure 1

18 pages, 439 KB  
Article
Understanding and Predicting Tourist Behavior Through Large Language Models
by Anna Dalla Vecchia, Simone Mattioli, Sara Migliorini and Elisa Quintarelli
Big Data Cogn. Comput. 2026, 10(4), 117; https://doi.org/10.3390/bdcc10040117 - 11 Apr 2026
Viewed by 411
Abstract
Understanding and predicting how tourists move through a city is a challenging task, as it involves a complex interplay of spatial, temporal, and social factors. Traditional recommender systems often rely on structured data, trying to capture the nature of the problem. However, recent [...] Read more.
Understanding and predicting how tourists move through a city is a challenging task, as it involves a complex interplay of spatial, temporal, and social factors. Traditional recommender systems often rely on structured data, trying to capture the nature of the problem. However, recent advances in Large Language Models (LLMs) open new possibilities for reasoning over richer, text-based representations of user context, even without a dedicated pre-training phase. In this study, we investigate the potential of LLMs to interpret and predict tourist movements in a real-world application scenario involving tourist visits to Verona, a municipality in Northern Italy, between 2014 and 2023. We propose an incremental prompt engineering approach that gradually enriches the model input, from spatial features alone to richer behavioral information, including visit histories, time information, and user cluster patterns. The approach is evaluated using six open-source models, enabling us to compare their accuracy and efficiency across various levels of contextual enrichment. The results provide a first insight about the abilities of LLMs to incorporate spatio-temporal contextual factors, thus improving predictions, while maintaining computational efficiency. The analysis of the model-generated explanations completes the picture by adding an interpretability dimension that most existing next-PoI prediction solutions lack. Overall, the study demonstrates the potential of LLMs to integrate multiple contextual dimensions in tourism mobility, highlighting the possibility of a more text-oriented, adaptive, and explainable T-RS. Full article
(This article belongs to the Section Large Language Models and Embodied Intelligence)
Show Figures

Figure 1

6 pages, 450 KB  
Proceeding Paper
Class Entity Identification Based on Large Language Models: A Choice Between Classification and Generation
by Eric Jui-Lin Lu and Cheng-Hao Yang
Eng. Proc. 2026, 134(1), 42; https://doi.org/10.3390/engproc2026134042 - 10 Apr 2026
Viewed by 238
Abstract
Large language models (LLMs) have been widely applied to knowledge graph question answering (KGQA) systems. Recent Text-to-SPARQL studies have demonstrated that generation performance can achieve an F1 score exceeding 90%. Further error analysis has categorized common errors into entity translation errors, entity position [...] Read more.
Large language models (LLMs) have been widely applied to knowledge graph question answering (KGQA) systems. Recent Text-to-SPARQL studies have demonstrated that generation performance can achieve an F1 score exceeding 90%. Further error analysis has categorized common errors into entity translation errors, entity position errors, and resource description framework (RDF) triple-count errors, with the latter accounting for 24% of all errors. Notably, nearly 90% of RDF triple-count errors occur when the triples involve class entities. Previous research has shown that incorporating prompts can effectively enhance model performance. Based on the results, we predicted whether a question contains a class entity and the number of RDF triples in the corresponding query to reduce RDF triple-count errors in large language models by providing precise task-related information through prompt design. Since both strategies are classification-oriented, two implementation paradigms were established: traditional classification architectures and generative modeling. They were compared in terms of performance. For classification-based architectures, we employed Bidirectional Encoder Representations from Transformers (BERT) and the Robustly Optimized BERT Approach (RoBERTa) to obtain question embeddings for classification. For the generative approach, we adopted the Instruction-Tuned Text-to-Text Transfer Transformer (Flan-T5). Experimental results show that the generative model slightly outperforms conventional classification architectures, indicating that generative approaches can achieve higher prediction accuracy and provide more reliable information without the need for additional complex encoder designs, thereby improving the overall quality of Text-to-SPARQL generation. Full article
Show Figures

Figure 1

19 pages, 9603 KB  
Article
Understanding Modality-Specific Vulnerabilities in Vision–Language Models Under Adversarial Attacks
by Maisha Binte Rashid and Pablo Rivas
AI 2026, 7(4), 135; https://doi.org/10.3390/ai7040135 - 9 Apr 2026
Viewed by 472
Abstract
Vision–language models (VLMs), such as Contrastive Language–Image Pretraining (CLIP), are increasingly deployed in real-world applications, including content moderation, misinformation detection, and fraud analysis, making their robustness to adversarial attacks a critical concern. While adversarial robustness has been widely studied in unimodal models, modality-specific [...] Read more.
Vision–language models (VLMs), such as Contrastive Language–Image Pretraining (CLIP), are increasingly deployed in real-world applications, including content moderation, misinformation detection, and fraud analysis, making their robustness to adversarial attacks a critical concern. While adversarial robustness has been widely studied in unimodal models, modality-specific vulnerabilities in multimodal models remain underexplored. In this work, we analyze CLIP by applying gradient-based adversarial attacks to its vision and language modalities, both independently and jointly, and evaluating performance on two multimodal classification benchmarks: the Facebook Hateful Memes dataset and a large-scale Suspicious Car Parts dataset. Using Fast Gradient Sign Method (FGSM) and Projected Gradient Descent (PGD) attacks along with multiple adversarial retraining strategies, we show that adversarial perturbations on the image modality consistently cause the most severe and unstable performance degradation. These results demonstrate that the vision modality is the primary vulnerability in CLIP, highlighting the need for modality-specific defense strategies that focus more on the weaker modality in multimodal systems. Full article
(This article belongs to the Section AI Systems: Theory and Applications)
Show Figures

Graphical abstract

26 pages, 2454 KB  
Article
Evolving LLMs from Next-Token Prediction to Multi-Token Prediction via Self-Distillation
by Yang Xu and Wanxiang Che
Electronics 2026, 15(7), 1533; https://doi.org/10.3390/electronics15071533 - 6 Apr 2026
Viewed by 689
Abstract
Mainstream Large Language Models (LLMs) work under the paradigm of Next-Token Prediction (NTP). Multi-Token Prediction (MTP) is motivated by higher decoding efficiency, extending NTP to enable LLMs to draft multiple tokens during each forward pass. However, existing MTP approaches pretrain MTP along with [...] Read more.
Mainstream Large Language Models (LLMs) work under the paradigm of Next-Token Prediction (NTP). Multi-Token Prediction (MTP) is motivated by higher decoding efficiency, extending NTP to enable LLMs to draft multiple tokens during each forward pass. However, existing MTP approaches pretrain MTP along with the target LLM, making it difficult to unlock MTP for LLMs without official support. In this work, we propose a post-hoc approach to training an MTP module for a target LLM, providing an efficient way to evolve the LLM from NTP to MTP. The proposed approach features two main characteristics. (1) No changes to the target LLM, since it is frozen during MTP training. (2) Efficient MTP training via self-distillation from the target LLM’s native NTP capability. Results show that our approach can post-hoc train a performant MTP module via lightweight pretraining. Full article
Show Figures

Figure 1

15 pages, 1901 KB  
Article
DW-ReID: Vision–Language Learning for Person Re-Identification Under Diverse Weather Conditions
by Lei Cai, Yuying Liang, Bin Wang, Hexi Li, Jinquan Yang and Tao Zhu
Sensors 2026, 26(7), 2263; https://doi.org/10.3390/s26072263 - 6 Apr 2026
Viewed by 561
Abstract
Person re-identification (ReID) under diverse weather conditions remains a critical yet insufficiently explored problem. Most existing ReID approaches are developed and benchmarked on clear-weather datasets, resulting in significant performance degradation when deployed in rainy, snowy, or hazy environments. Conventional image restoration methods, typically [...] Read more.
Person re-identification (ReID) under diverse weather conditions remains a critical yet insufficiently explored problem. Most existing ReID approaches are developed and benchmarked on clear-weather datasets, resulting in significant performance degradation when deployed in rainy, snowy, or hazy environments. Conventional image restoration methods, typically optimized for low-level image quality metrics, are often misaligned with the objectives of high-level identity discrimination and thus fail to improve the person ReID performance. To address these limitations, we propose DW-ReID, a unified framework that integrates weather-degraded image restoration with person re-identification tasks. The proposed DW-ReID is built upon a large-scale Contrastive Language-Image Pre-training (CLIP) model and achieved by a two-stage training paradigm. In the first stage, a set of learnable text prompts is optimized to construct identity-specific ambiguous descriptions for each person’s identity. In the second stage, the optimized text descriptions, together with a frozen text encoder, provide language supervision to jointly train a weather encoder, an image restorer, and a ReID encoder in an end-to-end manner. The experimental results on two our contributed synthetic datasets consistently demonstrate the effectiveness and superior performance of the proposed DW-ReID method. Full article
Show Figures

Figure 1

Back to TopTop