MDPI - Publisher of Open Access Journals

23 pages, 9226 KB

Open AccessArticle

A Method for Comment Text Feature Mining via Integrated Keyword Extraction, Clustering, and Sentiment Analysis

by Jinbao Song, Jiahui Cai, Yijun Wang, Kai Wang, Shiwen Cui and Nuo Xu

Appl. Syst. Innov. 2026, 9(6), 124; https://doi.org/10.3390/asi9060124 - 11 Jun 2026

Viewed by 114

In recent years, short video platforms have rapidly developed into important media for cultural dissemination. The interactions of netizens in short video comment sections not only reflect their focus on cultural content but also contain rich emotional attitudes. However, given the vast and [...] Read more.

In recent years, short video platforms have rapidly developed into important media for cultural dissemination. The interactions of netizens in short video comment sections not only reflect their focus on cultural content but also contain rich emotional attitudes. However, given the vast and fragmented nature of comment data, accurately extracting keywords, identifying cultural themes, and analyzing sentiment tendencies pose significant challenges in understanding netizens’ cultural perceptions. To address these challenges, this study proposes a text analysis framework that integrates keyword extraction, clustering analysis, and sentiment analysis to explore the core topics and emotional characteristics of cultural dissemination in short video comment sections. Firstly, to address the challenge of balancing statistical information and semantic understanding in short-text keyword extraction, this paper proposes the TF-IDF-KeyBERT Integrated Algorithm (TKIA) keyword extraction algorithm, which integrates Term Frequency–Inverse Document Frequency (TF-IDF) and Key Bidirectional Encoder Representations from Transformers (BERT). Experiments on the CSL dataset demonstrate improvement in the F1@5 metric, showing its potential to enhance keyword extraction performance for short texts. Secondly, to address the difficulty of simultaneously considering semantic representation capability and clustering flexibility in short-text clustering analysis, this paper designs the Self-Supervised Contrastive Enhanced Clustering (SCEC) algorithm by integrating self-supervised contrastive learning with a soft clustering strategy. Compared to baseline methods, SCEC improves clustering accuracy (ACC) by 17.5% on AGNews and 6.8% on THUCNews, suggesting a more effective way to reveal the underlying structure of cultural topics. Finally, to address the challenge of effectively leveraging both text structural information and global semantic features in short-text sentiment analysis, this paper develops the BERT-GCN Cross-Attention (BGC) Model, integrating BERT embeddings and Graph Convolutional Network (GCN)-based structural features via a Cross-Attention mechanism. On the My_weibo_senti_100k dataset, the BGC model achieves a 2.45% increase in Macro-F1 and a 2.41% improvement in accuracy over strong baselines, offering its ability for high-precision modeling of user sentiment. This study offers effective data support and technical pathways for applications such as cultural content understanding, personalized recommendation, and user emotion guidance. Full article

(This article belongs to the Special Issue Smart and Human-Centered Rehabilitation Technologies and Systems)

► Show Figures

Figure 1

24 pages, 3427 KB

Open AccessArticle

A Multi-Class Classification Model for Text Related to Online Public Opinion Risks in Higher Education Institutions Based on Confidence-Aware Dynamic Fusion

by Xin Gu, Chengjun Wang, Kai Wang and Xiang Zhao

Information 2026, 17(6), 579; https://doi.org/10.3390/info17060579 - 10 Jun 2026

Viewed by 75

Abstract

With the widespread use of social media and online platforms in the dissemination of public opinion within universities, the multi-class classification of risk-related texts has become a critical component of online public opinion analysis in higher education institutions. Existing multi-class risk classification methods [...] Read more.

With the widespread use of social media and online platforms in the dissemination of public opinion within universities, the multi-class classification of risk-related texts has become a critical component of online public opinion analysis in higher education institutions. Existing multi-class risk classification methods often focus on static semantic representations, making it difficult to effectively capture the emotional evolution within texts and the differences between samples, which in turn affects the accuracy of risk classification. To address this, this paper proposes a multi-class risk classification model for university online public opinion that integrates contextual semantic modeling, emotional evolution detection, and adaptive confidence-based feature fusion. The model employs pre-trained BERT for context encoding and, while preserving high-level semantic information, enhances the model’s adaptability to domain-specific features through a selective unfreezing strategy. First, a Bidirectional Gated Recurrent Unit (BiGRU) is introduced to model the emotional evolution trajectory within text sequences, and an emotional transition intensity metric is constructed by calculating the difference between adjacent hidden states, thereby explicitly capturing the magnitude of emotional changes. Additionally, a convolutional feature branch is designed to capture local emotional patterns, enhancing the model’s ability to perceive local risk cues and fine-grained emotional fluctuations. Finally, the Emotion-Adaptive Feature Mixer (EAFM) is introduced. This module adaptively weights global emotional evolution features and local emotional pattern features based on sample confidence to adjust the contributions of different feature branches in risk classification. Experimental results demonstrate that the proposed model exhibits good convergence characteristics in the university online public opinion scenario represented by the CUOPO dataset and demonstrates strong interpretability through attention visualization and confidence coefficient analysis. Full article

► Show Figures

Figure 1

27 pages, 7120 KB

Open AccessArticle

Systematic Fine-Tuning of Transformer Models for Domain-Specific Misinformation Detection in Spanish Social Media Text

by Gabriel Hurtado Avilés, José A. Reyes-Ortiz, Román A. Mora-Gutiérrez, Josué Padilla Cuevas and Óscar Herrera Alcántara

Informatics 2026, 13(6), 83; https://doi.org/10.3390/informatics13060083 - 9 Jun 2026

Viewed by 118

Abstract

While social media platforms are primary vectors for misinformation, automated detection systems remain largely confined to English. This paper presents a transferable, three-stage framework for fine-tuning transformer models to detect domain-specific deceptive content in Spanish. The pipeline comprises: (1) corpus unification, merging fragmented [...] Read more.

While social media platforms are primary vectors for misinformation, automated detection systems remain largely confined to English. This paper presents a transferable, three-stage framework for fine-tuning transformer models to detect domain-specific deceptive content in Spanish. The pipeline comprises: (1) corpus unification, merging fragmented datasets into a 61,674-article resource mapped into three classes (Real, Fake, Satire) to prevent stylistic confounding; (2) systematic model optimization, extensively benchmarking classical metaheuristics against eight transformer architectures (including mBERT, XLM-RoBERTa, and BETO) using strong regularization to mitigate overfitting; and (3) production deployment, encapsulating the optimized model as a containerized web application for real-time inference. Through rigorous experimentation, the Spanish-specific BETO encoder emerged as the strongest model for this task, achieving 89.18% overall accuracy. The model attains a near-perfect in-source F1-score on the satire class; however, a strict source-held-out test reveals that this performance is highly source-dependent—recall on satire from an unseen outlet drops to 0.08—indicating that single-source class construction leads the model to recognize the source rather than a generalizable category. We report this finding as a central methodological result: corpus design, and in particular the source diversity of each class, is the primary determinant of whether the framework generalizes. Adversarial robustness tests using named-entity masking and typo injection provide complementary evidence on the model’s reliance on semantic versus surface cues. The methodology is designed to be adaptable across domains: by substituting the training corpus, the same framework may in principle be retargeted to other digital threats, such as investment scams and phishing, provided that suitable labeled corpora are constructed and validated for each new domain. The complete framework, dataset, and application are released as open-source resources to support reproducible research and practical countermeasures against online misinformation. Full article

(This article belongs to the Special Issue Machine Learning in Social Media Analysis)

► Show Figures

Figure 1

32 pages, 3177 KB

Open AccessArticle

InspectCL: A Contrastive Learning Assistant for Similar Case Retrieval in Organizational Audit and Compliance

by Jianfeng Liu, Yuetian Huang, Changhua Hu, Kangheng Feng, Suining Zhu, Qingguo Shi and Yi Su

Electronics 2026, 15(11), 2495; https://doi.org/10.3390/electronics15112495 - 5 Jun 2026

Viewed by 168

Abstract

In large-scale state-owned enterprise audit and compliance tasks, ensuring that similar violations receive consistent disciplinary decisions is essential for procedural fairness and institutional credibility. However, existing retrieval methods face three major challenges: lexical matching methods fail to recognize semantically equivalent violation descriptions, general-purpose [...] Read more.

In large-scale state-owned enterprise audit and compliance tasks, ensuring that similar violations receive consistent disciplinary decisions is essential for procedural fairness and institutional credibility. However, existing retrieval methods face three major challenges: lexical matching methods fail to recognize semantically equivalent violation descriptions, general-purpose semantic encoders lack knowledge of inspection-specific terminology and regulatory distinctions, and retrieved precedents are often not directly transformed into actionable disciplinary references. To address these problems, this paper proposes InspectCL, a domain-enhanced contrastive learning and Retrieval-Augmented Generation framework for similar case retrieval, validated on audit data from a provincial power grid company. First, to provide task-specific supervision that is unavailable in existing benchmarks, we construct InspectCase, a de-identified dataset of 4200 audit and compliance cases across 12 violation categories, with expert-validated positive pairs and hard negative pairs. Second, to overcome the weak domain awareness of generic encoders, we design a domain-enhanced contrastive learning model. Specifically, terminology-masking augmentation improves robustness to specialized inspection expressions, regulatory semantic injection incorporates disciplinary rules to distinguish factually similar but legally different cases, and hierarchical contrastive optimization strengthens both case-level similarity learning and category-level boundary separation. Third, to convert retrieved precedents into practical decision support, the Top-K similar cases are used as evidence for a large language model to generate structured disciplinary recommendation summaries, including violation classification, penalty references, applicable regulations, and rectification measures. Experimental results on InspectCase show that InspectCL substantially outperforms BM25, BERT-base, SimCSE, and Legal-BERT baselines, achieving 56.9% ± 0.7% Recall@5 and an 87.6% ± 0.4% Penalty Consistency Score (PCS). These results demonstrate that the proposed problem-driven modules jointly improve semantic retrieval accuracy and disciplinary decision consistency, offering a practical reference for similar power-grid audit scenarios, with broader applicability to be validated in future cross-domain studies. Full article

(This article belongs to the Special Issue AI-Powered Natural Language Processing Applications)

► Show Figures

Figure 1

26 pages, 33536 KB

Open AccessArticle

A Global Collaborative Discriminative Denoising Network for Text-to-Image Person Re-Identification

by Shaozhen Han and Shuai Guo

Sensors 2026, 26(11), 3604; https://doi.org/10.3390/s26113604 - 5 Jun 2026

Viewed by 375

Abstract

Text-to-Image Person Re-Identification (TI-ReID) aims to retrieve target pedestrians from large-scale image galleries using natural language descriptions. Despite recent progress achieved by dual-tower architectures based on vision-language pre-training, these methods remain susceptible to semantic misalignment and noise induced by occlusions, background clutter, and [...] Read more.

Text-to-Image Person Re-Identification (TI-ReID) aims to retrieve target pedestrians from large-scale image galleries using natural language descriptions. Despite recent progress achieved by dual-tower architectures based on vision-language pre-training, these methods remain susceptible to semantic misalignment and noise induced by occlusions, background clutter, and fine-grained attribute distractions. To mitigate these issues, we propose a Global Collaborative Discriminative Denoising Network (GCDD), a dual-tower fine-tuning framework built upon a CLIP visual encoder and a BERT text encoder. Specifically, GCDD introduces three complementary branches for robust feature enhancement. First, Discriminative Token Selection (DTS) performs adaptive hard filtering to suppress low-informative tokens. Second, Global-Guided Feature Adaptation (GFA) leverages modality-specific global semantics to recalibrate local features. Third, Query-Driven Aggregation (QDA) constructs more discriminative global representations via attentive pooling, where the backbone global feature serves as the query. The outputs of the three branches are fused through a parameter-free averaging strategy to produce the final representation. Extensive experiments on three standard TI-ReID benchmarks demonstrate that GCDD achieves strong competitive performance, validating the effectiveness of the proposed feature enhancement framework. Full article

(This article belongs to the Section Sensing and Imaging)

► Show Figures

Figure 1

23 pages, 1197 KB

Open AccessArticle

MG-ECF: Multi-Granularity Entity-Context Fusion for Drug–Drug Interaction Extraction

by Hiba Chanaa, Loqman Chakir and El Habib Nfaoui

Future Internet 2026, 18(6), 289; https://doi.org/10.3390/fi18060289 - 27 May 2026

Viewed by 239

Abstract

Adverse drug–drug interactions (DDIs) are a leading cause of preventable medication-related harm, and their automatic detection from the biomedical literature is critical for pharmacovigilance and clinical decision support. Most existing systems derive relation representations from a single source, under-using the rich contextual structure [...] Read more.

Adverse drug–drug interactions (DDIs) are a leading cause of preventable medication-related harm, and their automatic detection from the biomedical literature is critical for pharmacovigilance and clinical decision support. Most existing systems derive relation representations from a single source, under-using the rich contextual structure of DDI sentences. We propose MG-ECF (Multi-Granularity Entity-Context Fusion), a relation-classification architecture that extracts three complementary views from a shared biomedical encoder—entity-level, inter-entity contextual, and global sentence representations—which are adaptively combined through a temperature-scaled gating mechanism regularized by view-dropout. MG-ECF was evaluated on the DDI-2013 benchmark under the official shared-task protocol, with multi-seed experiments on BioBERT and BiomedBERT backbones and a focal-loss objective to address severe class imbalance. MG-ECF achieves a mean micro-F1 of 90.55% with BiomedBERT and 88.8% with BioBERT, an absolute improvement of 2.7 F1 points over the strongest previously reported PLM-based system (BioMCL-DDI, 87.8%). Systematic component analyses confirm the contribution of each representational view, demonstrating the effectiveness of multi-granularity fusion for DDI classification and its potential as a research-stage building block for Internet-based pharmacovigilance platforms and networked clinical decision support systems, pending real-world clinical validation. Full article

► Show Figures

Figure 1

20 pages, 769 KB

Open AccessArticle

Note-Level Phenotyping of Multiple-Sclerosis Notes by a Large Language Model Achieves near Human-Level Agreement

by Daniel B. Hier, Pavankumar Y. Srinivasula and Michael D. Carrithers

J. Clin. Med. 2026, 15(11), 4092; https://doi.org/10.3390/jcm15114092 - 25 May 2026

Viewed by 208

Abstract

Background/Objectives: Clinical phenotyping from narrative electronic health records (EHRs) often relies on multi-stage pipelines involving span-level extraction, ontology mapping, and aggregation. Large language models (LLMs) may enable direct document-level abstraction of clinically meaningful phenotype features from complete notes. We evaluated whether GPT-5.2 [...] Read more.

Background/Objectives: Clinical phenotyping from narrative electronic health records (EHRs) often relies on multi-stage pipelines involving span-level extraction, ontology mapping, and aggregation. Large language models (LLMs) may enable direct document-level abstraction of clinically meaningful phenotype features from complete notes. We evaluated whether GPT-5.2 could approximate human annotation for note-level multiple sclerosis (MS) phenotyping and compared its performance with human annotators, a locally run open-source LLM, HPO-based extraction tools, and a supervised clinical transformer encoder. Methods: We analyzed 100 de-identified MS neurology progress notes from a single academic medical center. Each note was annotated for the presence or absence of 17 predefined neurological phenotype categories. Two human annotators independently labeled all notes using a multi-label note-level framework in Prodigy, and disagreements were adjudicated to create a reference annotation set. GPT-5.2 was evaluated in a zero-shot setting using structured JSON output. Comparator methods included Llama-3.1 8B, Doc2Hpo, ClinPhen, PhenoSnap, and BioClinical ModernBERT. Performance was assessed using agreement, precision, recall, F1, Matthews correlation coefficient, and false-positive and false-negative assignments per note. Results: Human–human agreement was generally high, although lower for rare or ambiguously documented features. GPT-5.2 achieved the strongest automated performance, with macro-precision 0.734, macro-recall 0.921, macro-F1 0.801, and macro-averaged MCC 0.777, approaching human annotator performance. GPT-5.2 showed the lowest false-negative count per note but more false-positive assignments than either human annotator, reflecting a sensitive but more inclusive annotation profile. Llama-3.1 8B performed competitively among automated methods, whereas HPO-based extraction tools and BioClinical ModernBERT showed lower performance on this low-resource note-level task. Secondary review of GPT-5.2 discordant assignments found no clear hallucinations and suggested that some apparent false positives reflected phenotype evidence missed in the human-derived reference set. Conclusions: GPT-5.2 achieved near-human performance for document-level recognition of MS phenotype categories from narrative neurology notes. Direct note-level abstraction may provide a scalable approach for research and population-health phenotyping of large EHR note corpora. Full article

(This article belongs to the Special Issue Advancing Personalized and Precision Medicine with Large Language Models)

► Show Figures

Figure 1

32 pages, 4763 KB

Open AccessArticle

Explainable Text-Based Depression and Suicide Risk Prediction from Social Media Using Deep Learning and Graph Neural Networks

by Atiq Ur Rehman, Abid Iqbal, Ali Sayyed, Zaheer Aslam, Muhammad Ismail Mohmand and Ghassan Husnain

Healthcare 2026, 14(11), 1440; https://doi.org/10.3390/healthcare14111440 - 22 May 2026

Viewed by 253

Abstract

Objectives: The rise in the frequency of mental health concerns (depression and suicide) expressed on social media calls for reliable, explainable, and efficient computational methods for mental health surveillance. In this paper, we propose an interpretable framework for text-based detection of post- and [...] Read more.

Objectives: The rise in the frequency of mental health concerns (depression and suicide) expressed on social media calls for reliable, explainable, and efficient computational methods for mental health surveillance. In this paper, we propose an interpretable framework for text-based detection of post- and community-level mental health risk on social media. Methods: The framework combines (i) Secretary Bird Optimization (SBO) for feature selection of informative linguistic and psychological features, (ii) a BERT (Bidirectional Encoder Representations from Transformers)—CNN (Convolutional Neural Network) model for post-level reasoning, and (iii) a Graph Neural Network (GraphSAGE) for community-level reasoning. The graph is estimated based on semantic similarity between posts and author relations, instead of social interactions (e.g., mentions, replies) between authors. We use SHAP and LIME for model interpretability, uncertainty, and calibration analysis to evaluate the trustworthiness of predictions. Results: The model delivers 93.1% accuracy, 0.91 F1-score, and 0.944 ROC-AUC on the eRisk and CLPsych datasets using a strict user-disjoint validation strategy. SBO lowers the number of features by about 38%, leading to better generalization. The graph-based model enables improved learning of post and user representations by capturing relational dependencies. Conclusions: Our approach offers an explainable and robust means of detecting mental health risk from text. Graph-based representations of semantic and authorship interactions enable community-level analyses, while interpretability and uncertainty estimation facilitate possible human-in-the-loop decision-making. This research does not explicitly consider a human-in-the-loop experiment. Full article

(This article belongs to the Special Issue Innovative Suicide Prevention Methods: The Role of New Technologies and Medical Services in Saving Lives)

► Show Figures

Figure 1

23 pages, 705 KB

Open AccessArticle

LLM-SGCF: A Robust Malware Detection Framework with Spatially Guided Convolution

by Lina Zhao, Hua Huang, Ning Li, Yunxiao Wang and Ming Li

Computers 2026, 15(6), 329; https://doi.org/10.3390/computers15060329 - 22 May 2026

Viewed by 273

Abstract

With the rapid evolution of cyberattack techniques, identifying dynamic behavioral intents from Application Programming Interface call sequences has become a fundamental modality for ensuring reliable malware detection and information security. However, existing detection methods face the dual challenges of semantic sparsity and inadequate [...] Read more.

With the rapid evolution of cyberattack techniques, identifying dynamic behavioral intents from Application Programming Interface call sequences has become a fundamental modality for ensuring reliable malware detection and information security. However, existing detection methods face the dual challenges of semantic sparsity and inadequate spatial dependency modeling when processing these sequences, which fundamentally undermines their stability against complex structural variations and in-the-wild evasive patterns. To address these critical vulnerabilities, we propose LLM-SGCF, a highly effective malware detection framework that jointly models deep behavioral semantics and spatial structures. Specifically, our framework leverages generative Large Language Models, which are subsequently encoded by BERT, to transform sparse API calls into rich and contextualized descriptions. Concurrently, it employs a novel Spatially Guided Convolution (SGC) module to localize critical malicious segments and extract cross-position dependencies in a two-dimensional semantic space. Extensive experiments on the public Aliyun and Catak datasets demonstrate that LLM-SGCF exhibits exceptional resilience to real-world structural complexity and significantly outperforms state-of-the-art baselines, achieving a peak binary-classification accuracy of 95.82%. Further ablation analyses confirm that the synergistic fusion of semantic enhancement driven by Large Language Models and spatial structural modeling dramatically improves the resilience of the framework against complex attack chains, providing a highly reliable paradigm for next-generation malware recognition systems. Full article

(This article belongs to the Special Issue AI-Powered IoT (AIoT) Systems: Advancements in Security, Sustainability, and Intelligence)

► Show Figures

Figure 1

25 pages, 3702 KB

Open AccessArticle

MELT: Optimization-Driven Music Emotion Learning with Temporal Token-Level Fusion

by Yihe Yin, Zhen Tian and Junming Chen

Mathematics 2026, 14(10), 1690; https://doi.org/10.3390/math14101690 - 15 May 2026

Viewed by 283

Abstract

Music emotion recognition (MER) can be formulated as a multimodal optimization problem that predicts an emotion label from coupled audio and lyric sequences. Existing methods typically perform unimodal learning or coarse global fusion, which overlooks fine-grained temporal-token correspondences between musical dynamics and lyric [...] Read more.

Music emotion recognition (MER) can be formulated as a multimodal optimization problem that predicts an emotion label from coupled audio and lyric sequences. Existing methods typically perform unimodal learning or coarse global fusion, which overlooks fine-grained temporal-token correspondences between musical dynamics and lyric semantics. We propose MELT (Music Emotion Learning with Temporal token-level fusion), an optimization-driven framework with four modules: a BERT-based lyrics semantic encoder (LSE), a segment temporal encoder (STE) that models audio-segment dependencies via a Transformer, a token-level temporal fusion (TTF) module with gated cross-attention, and an emotion mood head (EMH) for four-class prediction. Training is conducted end-to-end by jointly minimizing a supervised classification term and an auxiliary cross-modal contrastive alignment term, yielding a unified objective that improves both class separability and representation consistency. On the MoodyLyrics benchmark, MELT achieves 87.6% weighted F1 for four-class emotion recognition (angry, happy, relaxed, sad), outperforming unimodal baselines and representative early/late fusion strategies. Ablation results further verify that temporal encoding, gated token-level fusion, and joint optimization each contribute to the final performance. Full article

(This article belongs to the Special Issue Intelligent Mathematics and Applications)

► Show Figures

Figure 1

27 pages, 4001 KB

Open AccessFeature PaperArticle

SGA-DCAT: Sentiment-Gated Dual-Stream Cross-Attention Temporal Network for Stock Price Prediction

by Jing Liu, Yuan Lu and Wenhao Kang

Mathematics 2026, 14(10), 1661; https://doi.org/10.3390/math14101661 - 13 May 2026

Viewed by 220

Abstract

We propose SGA-DCAT (Sentiment-Gated Dual-stream Cross-Attention Temporal Network), an architecture for stock price prediction that treats news sentiment not as a passive input feature but as a control signal that governs the model’s memory dynamics. Most sentiment-augmented methods simply concatenate sentiment scores with [...] Read more.

We propose SGA-DCAT (Sentiment-Gated Dual-stream Cross-Attention Temporal Network), an architecture for stock price prediction that treats news sentiment not as a passive input feature but as a control signal that governs the model’s memory dynamics. Most sentiment-augmented methods simply concatenate sentiment scores with price inputs, which implicitly assumes that the relationship between sentiment and price is the same regardless of market conditions. SGA-DCAT departs from this practice in three ways: (1) a sentiment-gated adaptive LSTM cell modulates the forget and input gates directly through sentiment signals so that memory retention varies with market mood; (2) a dual-stream cross-attention mechanism encodes price and sentiment through separate recurrent networks and fuses them via learned attention, keeping modality-specific representations intact; and (3) a multi-scale adaptive aggregation module combines predictions from 5-day, 10-day, and 20-day windows using a gating network conditioned on the current market state. On the NASDAQ-100 index with FinBERT-derived news sentiment, SGA-DCAT achieves a MAPE of 1.40%, RMSE of 217.83, and MAE of 171.58, outperforming Persistence, ARIMA, GARCH, MLP, LSTM, GRU, Transformer, DA-RNN, TFT, and LSTM + FinBERT baselines (all

p < 0.001

, Diebold–Mariano test). In ablation experiments, each component contributes measurably, and SGA-DCAT reaches 66.93% directional accuracy on the held-out test set. Rolling-window evaluation across four non-overlapping temporal folds and regime-specific analysis confirm that these results are robust across bull, bear, and volatile market conditions. Full article

► Show Figures

Figure 1

31 pages, 13501 KB

Open AccessArticle

Adaptive 3D Human Pose Estimation Based on Spatial–Temporal Complexity Awareness

by Wensi Zhang, Ziyan Yang, Chengfeng Hu, Jing Sun and Jie Li

Electronics 2026, 15(10), 2076; https://doi.org/10.3390/electronics15102076 - 13 May 2026

Viewed by 384

Abstract

Existing 3D human pose estimation methods use fixed computation strategies processing diverse action sequences, leading to computational redundancy for simple actions, insufficient high-frequency information capture for complex actions, and low long-sequence processing efficiency. To address these issues, this paper proposes a Spatial–Temporal Complexity-Aware [...] Read more.

Existing 3D human pose estimation methods use fixed computation strategies processing diverse action sequences, leading to computational redundancy for simple actions, insufficient high-frequency information capture for complex actions, and low long-sequence processing efficiency. To address these issues, this paper proposes a Spatial–Temporal Complexity-Aware Adaptive Computation Framework (CAAPoseFormer). First, a spatial–temporal coupled complexity quantification module is built to integrate spatial dispersion and temporal motion variance for graded action complexity quantification. On this basis, a time–frequency dual-domain adaptive pruning strategy is proposed to dynamically allocate temporal window length and frequency-domain DCT coefficients on demand. Furthermore, a mask-guided sparse interaction encoding mechanism is designed to enable efficient parallel computation of variable-length features by shielding invalid padding regions. Experiments on the Human3.6M dataset show that, versus the baseline PoseFormerV2, the proposed method cuts parameters by 85.3% and computational cost by 64.8% while retaining comparable accuracy (MPJPE 44.2 mm), boosting unit computational efficiency 2.8×. Moreover, compared with state-of-the-art (SOTA) methods like MHFormer and MotionBERT, our method reduces computational costs (MACs) by 97.4% and nearly three orders of magnitude, respectively. This framework effectively breaks the inference bottleneck of high-precision models on low-power hardware, suiting latency-sensitive real-time applications well. Full article

(This article belongs to the Special Issue Advances in Real-Time Object Detection and Tracking)

► Show Figures

Figure 1

32 pages, 8414 KB

Open AccessArticle

TVLightFormer: A Lightweight Cross-Modal Transformer for Language-Guided Target Localization in SAR Imagery

by Yuqiao Zhong, Haoqi Quan, Chenyu Nie, Yingmei Wei and Yanming Guo

Remote Sens. 2026, 18(9), 1430; https://doi.org/10.3390/rs18091430 - 4 May 2026

Viewed by 281

Abstract

We study language-guided target localization in synthetic aperture radar (SAR) imagery for deployment on resource-constrained platforms. Existing vision-language models either rely on heavy backbones unsuitable for edge devices or are designed for natural images, overlooking SAR-specific characteristics such as speckle noise, weak scattering [...] Read more.

We study language-guided target localization in synthetic aperture radar (SAR) imagery for deployment on resource-constrained platforms. Existing vision-language models either rely on heavy backbones unsuitable for edge devices or are designed for natural images, overlooking SAR-specific characteristics such as speckle noise, weak scattering responses, and geometric distortions. The proposed model, TVLightFormer, combines a lightweight dual-modal encoder (MobileNetV3 and TinyBERT) with a grouped-query attention (GQA) mechanism for efficient cross-modal interaction and an activation-free lightweight feature pyramid network (LFPN) to handle scale variation while preserving weak scattering signals. The individual modules are not claimed as newly invented components; the main contribution lies in their SAR-aware integration for edge-oriented cross-modal localization. We evaluate the model on five remote sensing datasets—SOMA-1M, ATRNet-STAR, GAIA, MLRSNet, and SODAS—under a unified localization setting, and we explicitly discuss the limitations introduced by weak or scene-level annotations. The results show that TVLightFormer achieves a favorable trade-off between accuracy and efficiency, reaching an average mIoU of 69.8% with 27.4 M parameters and 9.7 GFLOPs. Ablation studies quantify the contribution of each component. The model is suited for edge-oriented scenarios where computational resources are limited. We also provide a critical analysis of failure cases, SAR-specific disturbance factors, loss-function choices, and dataset-protocol sensitivity. Full article

(This article belongs to the Special Issue Radar and Photo-Electronic Multi-Modal Intelligent Fusion)

► Show Figures

Figure 1

31 pages, 4372 KB

Open AccessArticle

Text-Anchored Residual Cross-Modal Fusion for Multimodal Sentiment Analysis: A Unified and Protocol-Aware Evaluation on MVSA-Single

by Kosala Natarajan and Nirmalrani Vairaperumal

Appl. Sci. 2026, 16(9), 4514; https://doi.org/10.3390/app16094514 - 4 May 2026

Viewed by 549

Abstract

Multimodal sentiment analysis aims to infer sentiment polarity by jointly modeling textual and visual information. Despite recent advances in pretrained language and vision encoders, sentiment prediction from social media posts remains challenging because textual and visual modalities are often weakly aligned, semantically noisy, [...] Read more.

Multimodal sentiment analysis aims to infer sentiment polarity by jointly modeling textual and visual information. Despite recent advances in pretrained language and vision encoders, sentiment prediction from social media posts remains challenging because textual and visual modalities are often weakly aligned, semantically noisy, and unevenly informative. Recent studies have emphasized the importance of fine-grained cross-modal fusion, stronger pretrained visual representations, and strategies for reducing modality bias in MVSA-style benchmarks. In this work, we present a systematic implementation-driven study of multimodal sentiment classification on MVSA-Single. We first construct a clean three-class sentiment-consistent subset and then implement a wide set of baselines, including text-only DistilBERT, image-only ResNet18, simple multimodal fusion, gated fusion, residual fusion, multi-task contrastive fusion, DINOv2-based fusion, and attention bottleneck fusion. Building on these experiments, we propose a semantic cross-modal fusion architecture that combines a RoBERTa text encoder with a CLIP vision encoder through cross-attention, allowing textual representations to selectively attend to sentiment-relevant visual signals. On the clean 2592-sample subset, the proposed model achieved the best overall performance, reaching 82.63% validation accuracy, 79.62% test accuracy and 79.42 weighted F1, outperforming all other implemented baselines under the same experimental pipeline and dataset setting. To improve comparability with prior MVSA-Single studies, we additionally reconstructed a broader processed setting from the 4511-sample HDF5 version and aligned 4318 text–image pairs with original image files. On this harder protocol-matched setting, the same model achieved 72.69% test accuracy and 70.66 weighted F1, revealing a substantial performance gap caused by dataset construction and residual multimodal noise. These findings show that strong cross-modal semantic alignment contributes more to robust multimodal sentiment prediction than simply increasing architectural complexity and that CLIP-based visual semantics are more beneficial than DINOv2 in our text–image sentiment setting. Full article

► Show Figures

Figure 1

26 pages, 1714 KB

Open AccessArticle

SV-GEN: Synergizing LLM-Empowered Variable Semantics and Graph Transformers for Vulnerability Detection

by Zhaohui Liu, Haocheng Yang and Wenjie Xie

Future Internet 2026, 18(5), 236; https://doi.org/10.3390/fi18050236 - 27 Apr 2026

Viewed by 623

Abstract

Deep-learning-based vulnerability detection has made substantial progress, but two limitations remain prominent. Sequence-based methods linearize source code and thus weaken the explicit modeling of control-flow and data-flow dependencies. Graph-based methods preserve program structure, yet conventional graph neural networks still have difficulty capturing long-range [...] Read more.

Deep-learning-based vulnerability detection has made substantial progress, but two limitations remain prominent. Sequence-based methods linearize source code and thus weaken the explicit modeling of control-flow and data-flow dependencies. Graph-based methods preserve program structure, yet conventional graph neural networks still have difficulty capturing long-range interactions in large code property graphs (CPGs). In addition, standard CPGs usually lack explicit variable semantics and security-critical node roles, which limits their ability to represent vulnerability-relevant program behavior. To address these issues, we propose SV-GEN, a vulnerability detection framework that combines large-language-model-driven semantic enhancement with hybrid sequence-graph learning. The novelty of SV-GEN lies in introducing a semantically enriched code property graph, termed Sem-CPG, which augments conventional CPGs with variable semantic roles and security-oriented node labels, and in coupling this representation with an adaptive fusion mechanism over structural and sequential views. Specifically, we use a large language model as an external semantic annotator to assign variable roles and identify source, sink, and sanitizer nodes, and then encode the resulting Sem-CPG with a Graph Transformer while modeling the code sequence with GraphCodeBERT. A learnable gating module is further used to adaptively fuse the graph-level and sequence-level representations for final prediction. Experiments on Devign, ReVeal, and DiverseVul show that SV-GEN achieves competitive or superior overall performance across benchmarks, with particularly strong improvements on the large and highly imbalanced DiverseVul dataset. Full article

(This article belongs to the Special Issue Security of Computer System and Network)

► Show Figures

Figure 1

Search Results (538)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (538)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI