MDPI - Publisher of Open Access Journals

44 pages, 7491 KB

Open AccessArticle

SemNet Explorer: An Evidence-Grounded Knowledge Graph–LLM Framework for Multi-Scale Mechanistic Reporting Across Biomedical Domains

by Xin He, David Camacho, Lama Moukheiber, Meghna Iyer, Benjamin Zhao, Christophe Ye, Batuhan Nursal, Xinyu Guo, Albert J. B. Lee and Cassie S. Mitchell

Big Data Cogn. Comput. 2026, 10(6), 171; https://doi.org/10.3390/bdcc10060171 - 25 May 2026

Abstract

Background: Mechanistic reporting from large-scale biomedical knowledge graphs remains challenging, particularly when integrating structured graph evidence with large language model (LLM)–based explanation in a reproducible and auditable manner. Existing approaches either rely on manual synthesis of graph-derived results or generate unconstrained narratives that [...] Read more.

Background: Mechanistic reporting from large-scale biomedical knowledge graphs remains challenging, particularly when integrating structured graph evidence with large language model (LLM)–based explanation in a reproducible and auditable manner. Existing approaches either rely on manual synthesis of graph-derived results or generate unconstrained narratives that lack traceability to underlying evidence. Methods: We present SemNet Explorer, an evidence-grounded knowledge graph–LLM unified framework for automated mechanistic reporting across biomedical domains using SemNet 2.0, a PubMed-scale heterogeneous knowledge graph. Given a set of target concepts and a selected semantic layer, the framework organizes graph-derived evidence into structured regions and generates two complementary report types: global reports for process-level mechanisms and anchor-centric reports for localized mediator-based explanations. A central methodological contribution is an ablation-derived adaptive grounding policy: we systematically compare alternative evidence-integration strategies across report types, semantic layers, and region structures, and use the resulting preferences to guide prompt selection in the deployed system. Results: SemNet Explorer produces stable region decompositions and interpretable report scaffolds across molecular (AAPP), disease-level (DSYN), and pharmacologic (PHSU) representations. For global reports, explicit evidence grounding improves expression quality more consistently than content accuracy, with benefits dependent on evidence density and semantic abstraction. In contrast, anchor-centric reports show consistent improvements in both content and expression under stronger, mediator-constrained prompting. These findings are supported by both pairwise ablation comparisons and absolute score analyses. Conclusions: SemNet Explorer establishes a generalizable unified framework and interactive platform for transforming knowledge graph evidence into reproducible mechanistic narratives across biomedical domains, including multimorbidity analysis, comparative pathophysiology, drug repurposing, and adverse event discovery. The results demonstrate that effective knowledge graph–LLM integration requires adaptive, context-dependent evidence grounding rather than fixed prompting strategies. Full article

(This article belongs to the Topic Advances in Integrative AI, Machine Learning, and Big Data for Transformative Applications)

28 pages, 761 KB

Open AccessArticle

A Survey on Student Awareness of Spoofing Attacks in Saudi Arabia

by Niddal H. Imam

Big Data Cogn. Comput. 2026, 10(6), 170; https://doi.org/10.3390/bdcc10060170 - 24 May 2026

Viewed by 76

Abstract

The increasing prevalence of digital communication has made students a primary target for various cyber threats, including identity deception and impersonation techniques that can lead to data breaches and financial loss. In Saudi Arabia, where the youth population is digitally active and integrated [...] Read more.

The increasing prevalence of digital communication has made students a primary target for various cyber threats, including identity deception and impersonation techniques that can lead to data breaches and financial loss. In Saudi Arabia, where the youth population is digitally active and integrated into online learning environments, understanding their vulnerability to such threats is paramount. This paper investigates university students’ awareness, confidence, and behavioral responses to different types of spoofing attacks, including email, SMS, caller ID, and website spoofing, in Saudi Arabia. A survey was conducted to gather data from 1437 students at Saudi Electronic University, and it was analyzed using a quantitative research methodology and different statistical tests, such as Chi-square tests, Friedman tests, Kruskal–Wallis tests, correlation analysis, and regression models. The analysis results indicate that students exhibit a relatively high level of awareness. However, awareness and confidence vary across demographic groups, with significant differences associated with gender and age group. The results also reveal a significant gap between perceived confidence and detection ability in scenario-based assessments, highlighting that self-reported awareness does not necessarily translate into practical identification skills. The study emphasizes the importance of strengthening practical cybersecurity education, simulation-based training, and effective awareness delivery methods to improve students’ ability to recognize impersonation-based cyber threats in the Saudi educational sector. Full article

► Show Figures

Figure 1

30 pages, 536 KB

Open AccessArticle

An Attention-Driven Feature Fusion Approach for Multimodal Aspect-Based Sentiment Analysis

by Ismail Ifakir, El Habib Nfaoui, Abderrahim Zannou and Asmaa Mourhir

Big Data Cogn. Comput. 2026, 10(6), 169; https://doi.org/10.3390/bdcc10060169 - 23 May 2026

Viewed by 121

Abstract

Aspect-Based Sentiment Analysis explores sentiment trends related to specific opinion aspects and holds significant commercial potential for monitoring brand reputation, understanding customer satisfaction, and personalizing recommendations. However, traditional methods rely exclusively on textual input and often struggle when the target aspect is not [...] Read more.

Aspect-Based Sentiment Analysis explores sentiment trends related to specific opinion aspects and holds significant commercial potential for monitoring brand reputation, understanding customer satisfaction, and personalizing recommendations. However, traditional methods rely exclusively on textual input and often struggle when the target aspect is not mentioned in the sentence. Multimodal Aspect-Based Sentiment Analysis addresses this limitation by incorporating both textual and visual modalities to enable more comprehensive sentiment understanding. Despite advancements in deep learning and transformer-based architectures, existing models often suffer from suboptimal modality fusion and weak aspect grounding, limiting their classification accuracy. To overcome these challenges, we propose an Attention-Driven Feature Fusion (ADFF) approach based on a three-stage hierarchical attention mechanism. First, it only fuses text and image embeddings. Second, it incorporates aspect-level features. Third, a multi-head attention layer further enhances cross-modal dependencies. The resulting representation is passed to a Long Short-Term Memory (LSTM) classifier for sentiment polarity prediction. We evaluate our model on three benchmark datasets, namely Twitter-2015, Twitter-2017, and MASAD. The experimental results demonstrate that the proposed model substantially outperforms state-of-the-art multimodal and unimodal baselines, improves both accuracy and F1-score, achieving 82.55% accuracy and 81.05% F1-score on Twitter-2015, 77.07% accuracy and 77.15% F1-score on Twitter-2017, and up to 99.67% accuracy and F1-score in the Plant domain of MASAD, where we observe consistent improvements across all seven domains. These results highlight the effectiveness and scalability of the hierarchical attention-based fusion strategy for real-world aspect-based sentiment analysis tasks. Full article

28 pages, 7348 KB

Open AccessArticle

Symbolic Disentangled Representations for Images

by Alexandr V. Korchemnyi, Alexey K. Kovalev and Aleksandr I. Panov

Big Data Cogn. Comput. 2026, 10(6), 168; https://doi.org/10.3390/bdcc10060168 - 22 May 2026

Viewed by 87

Abstract

The idea of disentangled representations is to reduce the data to a set of generative factors that produce it. Typically, such representations are vectors in latent space, where each coordinate corresponds to one of the generative factors. The object can then be modified [...] Read more.

The idea of disentangled representations is to reduce the data to a set of generative factors that produce it. Typically, such representations are vectors in latent space, where each coordinate corresponds to one of the generative factors. The object can then be modified by changing the value of a particular coordinate, but it is necessary to determine which coordinate corresponds to the desired generative factor—a difficult task if the vector representation has a high dimension. In this article, we propose ArSyD (Architecture for Symbolic Disentanglement), which represents each generative factor as a vector of the same dimension as the resulting representation. In ArSyD, the object representation is obtained as a superposition of the generative factor vector representations. We call such a representation a symbolic disentangled representation. We use the principles of Hyperdimensional Computing (also known as Vector Symbolic Architectures), where symbols are represented as hypervectors, allowing vector operations on them. Disentanglement is achieved by construction, no additional assumptions about the underlying distributions are made during training, and the model is only trained to reconstruct images in a weakly supervised manner. We study ArSyD on the dSprites and CLEVR datasets and provide a comprehensive analysis of the learned symbolic disentangled representations. ArSyD outperforms BetaVAE and FactorVAE baselines on CLEVR1 paired, achieving an FID of 93.72 compared to 129.68 and 115.61, respectively. It also achieves the best IOU value on dSprites paired, at 98.37, compared to 96.43 and 97.11 for the other baselines. We also propose new disentanglement metrics that allow comparison of methods using latent representations of different dimensions. ArSyD allows us to edit the object properties in a controlled and interpretable way, and the dimensionality of the object property representation coincides with the dimensionality of the object representation itself. Full article

(This article belongs to the Special Issue Artificial Intelligence Models and Cognitive Computing: Innovations from Algorithms to Intelligent Systems)

23 pages, 4279 KB

Open AccessArticle

Impact of Server-Side Aggregation on Federated Traffic Classification Under Heterogeneous Data Distributions

by Salam Allawi Hussein and Sándor R. Répás

Big Data Cogn. Comput. 2026, 10(6), 167; https://doi.org/10.3390/bdcc10060167 - 22 May 2026

Viewed by 181

Abstract

The growing prevalence of encrypted network traffic has rendered traditional payload-based inspection ineffective, shifting attention toward flow-level statistical analysis combined with machine learning. At the same time, privacy regulations and distributed network architectures make centralised data collection increasingly impractical, motivating federated learning as [...] Read more.

The growing prevalence of encrypted network traffic has rendered traditional payload-based inspection ineffective, shifting attention toward flow-level statistical analysis combined with machine learning. At the same time, privacy regulations and distributed network architectures make centralised data collection increasingly impractical, motivating federated learning as a privacy-preserving alternative. Despite its promise, deploying federated learning for encrypted traffic classification in realistic environments remains challenging, particularly under heterogeneous client data distributions that arise when different network sites observe different subsets of services. This paper examines how server-side aggregation affects federated QUIC traffic classification under such heterogeneous conditions. We use a five-class Google QUIC dataset and represent each flow with eight statistical features derived from packet size and timing. We compare a centralised baseline with federated learning under three client partitions: mixed-label clients (C1), service-based single-class clients (C2), and hash-based semi-IID clients (C3). For each case, we evaluate four Flower aggregation strategies: FedAvg, FedAdam, FedAvgM, and FedYogi. Results show that client distribution has a greater impact on performance than the choice of aggregation strategy. Federated models match or closely approach centralised performance in C1 and C3, with accuracy up to 0.9969 and macro-AUC near 1.0. In C2, accuracy drops due to extreme label skew, but adaptive aggregation mitigates the effect. FedYogi achieves the best C2 accuracy of 0.9287, while FedAvgM attains the highest C2 macro-AUC of 0.9885. ROC curves and confusion matrices confirm that the choice of aggregation matters mainly under severe heterogeneity. Full article

► Show Figures

Figure 1

23 pages, 677 KB

Open AccessArticle

Large Language Models for Energy Market Analytics: An Exploratory Feasibility Study Across Geopolitical Monitoring, Commodity Summarisation, and Renewable Forecasting

by Alex Krempasky, Erik Kajati and Peter Papcun

Big Data Cogn. Comput. 2026, 10(6), 166; https://doi.org/10.3390/bdcc10060166 - 22 May 2026

Viewed by 194

Abstract

Large Language Models (LLMs) offer opportunities for processing heterogeneous information streams relevant to energy-market decision-making, but their practical role in forecasting-oriented analytical workflows remains uncertain. This paper presents an exploratory feasibility study of LLM use across four energy-market tasks: geopolitical event monitoring for [...] Read more.

Large Language Models (LLMs) offer opportunities for processing heterogeneous information streams relevant to energy-market decision-making, but their practical role in forecasting-oriented analytical workflows remains uncertain. This paper presents an exploratory feasibility study of LLM use across four energy-market tasks: geopolitical event monitoring for Dutch Title Transfer Facility (TTF) market context using Global Database of Events, Language, and Tone (GDELT)-based data, structured summarisation of commodity-intelligence articles, prompt-engineered solar-power and grid-load forecasting for Austria, and a short-horizon exploratory TTF price-estimation case. The study is positioned as a pilot investigation and hybrid workflow blueprint rather than as a statistically conclusive forecasting benchmark. A four-layer reference architecture was devised, including structured market data, semi-structured news intelligence, web-scraping concepts, and implemented Twitter/X and GDELT monitoring layers. The empirical cases indicate that LLMs are most useful for text-heavy reasoning, event-context integration, source triage, and structured interpretation. In the 20-article summarisation corpus, Gemini 1.5 Pro achieved higher commodity-direction accuracy than GPT-4, while GPT-4 showed stronger output-format stability. In selected solar case checks, OpenAI models produced plausible generation curves close to the Fraunhofer ISE Energy Charts reference, while Energy Charts remained more accurate for aggregate load estimation in the available benchmark comparison. The two-day TTF experiment illustrated that LLMs can incorporate qualitative geopolitical context into short-horizon reasoning, but it did not establish reliable price-forecasting capability. The Twitter/X monitoring layer is retained as a documented negative pathway, showing the limitations of informal social-media scraping for reproducible market intelligence. Full article

(This article belongs to the Special Issue Large Language Models and Their Limitations)

► Show Figures

Figure 1

28 pages, 1742 KB

Open AccessArticle

Domestic Factors Influencing Perceived Interference in Distance Learning: A Machine Learning Approach in Residential Built Environments

by Virginia Puyana-Romero, Angela María Díaz-Márquez, Christiam Santiago Garzón-Pico and Giuseppe Ciaburro

Big Data Cogn. Comput. 2026, 10(5), 165; https://doi.org/10.3390/bdcc10050165 - 19 May 2026

Viewed by 292

Abstract

The change in learning methods to online/distance learning, catalyzed by recent health pandemics/social distancing requirements, has significantly changed how teaching occurs and what students experience in their learning spaces in regard to interference. New forms of interference exist, and they are related to [...] Read more.

The change in learning methods to online/distance learning, catalyzed by recent health pandemics/social distancing requirements, has significantly changed how teaching occurs and what students experience in their learning spaces in regard to interference. New forms of interference exist, and they are related to the domestic setting of the student’s life. This study examined how factors of domestic life influence what students find in regard to interference in their online learning spaces through a Likert-scale defined answer process to a 29-question predictor variable inventory that also includes two outcome variables that address the amount of acoustic interference experienced in learning spaces. Moreover, through regression models and various applications of machine learning science, this research aims to reveal crucial indicators that influence student experiences regarding disturbances. In this respect, these findings highlight crucial roles that housing density and internal interactive actions within residential contexts have on disturbances. Furthermore, this research reveals critical understandings of perceptual inequalities present within distance learning student populations and indicates significant cultural and social consequences related to digital technologies. This is crucial, understood within foundational perspectives that are necessary to address psychosocial challenges and human–building interaction present within distance learning science and policies aimed at reducing noise. Full article

► Show Figures

Figure 1

38 pages, 511 KB

Open AccessArticle

Similarity to a Single Set

by Lee Naish

Big Data Cogn. Comput. 2026, 10(5), 164; https://doi.org/10.3390/bdcc10050164 - 19 May 2026

Viewed by 122

Abstract

Identifying similarities in data is fundamental to discovery in science. Measuring or ranking similarity is a key way of reducing the dimensionality of data, is at the heart of many data intensive algorithms and can also be used directly for some applications. This [...] Read more.

Identifying similarities in data is fundamental to discovery in science. Measuring or ranking similarity is a key way of reducing the dimensionality of data, is at the heart of many data intensive algorithms and can also be used directly for some applications. This paper extends our understanding of a relatively simple similarity problem. Our primary application is spectral-based fault localisation (SBFL), in which a computer program is run with a large number of test cases and data is collected on which statements are executed in each test case. For each statement, the set of test cases in which it is executed is compared to the set of test cases that failed, and this is used to rank the statements to help locate bugs, an instance of what we call the similarity to a single set (STASS) problem. This paper is primarily theoretical but some contributions are validated with SBFL experiments. Set similarity is equivalent to similarity of binary vectors or two-by-two contingency tables. The problem is also equivalent to converting two-dimensional data with a “partial order”, such as points on a rectangular grid, to a one-dimensional total order. Even when the raw data is not binary, we are often interested in comparing binary classifiers for the data, such as diagnostic tests, and comparing binary classifiers is an instance of the STASS problem. More than a hundred set similarity measures have been proposed in the literature and hundreds of thousands have been evaluated for SBFL, but there is very little understanding of how best to choose a similarity measure for a given domain. This work discusses numerous properties and forms of symmetry that similarity measures can have. It refines previously identified properties so they are no longer incompatible, identifies new forms of symmetry, defines ordering relations over similarity measures, and proposes a new statistic that can be used to help choose a good similarity measure for a given domain. Full article

(This article belongs to the Section Data Mining and Machine Learning)

► Show Figures

Figure 1

20 pages, 3279 KB

Open AccessArticle

The Geometry of Privacy: A Two-Stage Analysis of Generative Membership Inference in Federated Learning

by Borja Arroyo Galende, Patricia A. Apellániz, Alejandro Almodóvar, Silvia Uribe, Federico Álvarez and Juan Parras

Big Data Cogn. Comput. 2026, 10(5), 163; https://doi.org/10.3390/bdcc10050163 - 19 May 2026

Viewed by 179

Abstract

We study Membership Inference Attack (MIA) risk in Federated Learning through a two-stage lens that separates (i) whether a target client’s contribution is detectable after aggregation and system noise (Stage I: Signal Survival) from (ii) whether a surviving contribution induces a generative membership [...] Read more.

We study Membership Inference Attack (MIA) risk in Federated Learning through a two-stage lens that separates (i) whether a target client’s contribution is detectable after aggregation and system noise (Stage I: Signal Survival) from (ii) whether a surviving contribution induces a generative membership score change attributable to the target’s private data (Stage II: Signal Attribution). Stage I models aggregation as a target–background decomposition and shows that detectability hinges on target–background alignment, which can induce cancellation. Stage II connects the surviving target component to a generative MIA score via a local path representation and Lipschitz/smoothness bounds, avoiding architecture-specific assumptions. Our analysis reveals that the leading attribution term is governed by the alignment between the target update and the score geometry of the target data at an appropriate baseline. We validate the theoretical bounds and illustrate risk trajectories across several scenarios. Full article

► Show Figures

Figure 1

26 pages, 396 KB

Open AccessArticle

Blockchains for Data Management: The DIGI4ECO Use Case and Practical Lessons Beyond Theory

by Andreas Polyvios Delladetsimas, Elias Iosif, Stamatis Papangelou and George Giaglis

Big Data Cogn. Comput. 2026, 10(5), 162; https://doi.org/10.3390/bdcc10050162 - 18 May 2026

Viewed by 209

Abstract

This article examines blockchain as an enabling technological component for data management tasks that are independent of currency-related functionality, a less-discussed aspect of a technology commonly associated with cryptocurrencies and decentralized finance (DeFi). Drawing on empirical findings from the DIGI4ECO project as a [...] Read more.

This article examines blockchain as an enabling technological component for data management tasks that are independent of currency-related functionality, a less-discussed aspect of a technology commonly associated with cryptocurrencies and decentralized finance (DeFi). Drawing on empirical findings from the DIGI4ECO project as a case study, we present a structured literature review and cross-domain analysis of blockchain-based data management systems (BDMSs), examine a representative permissioned BDMS implementation, and synthesize practical design guidelines and implementation insights for BDMS development. This perspective is motivated by core blockchain properties such as immutability and transparency, as well as by the observation that existing resources for BDMS development, including methods, tools, and best practices, remain fragmented and less developed than those available for more mature technologies. Full article

(This article belongs to the Section Big Data)

► Show Figures

Figure 1

28 pages, 3391 KB

Open AccessArticle

Enhanced Quantum-Inspired Deep Learning with Multi-Head Attention and Contrastive Learning for Text-Based Dialogue Sentiment Classification

by Fumin Zou, Lei Zou, Feng Guo, Xunhuang Wang, Jianqing Weng, Tao Fang, Haocai Jiang and Xueming Wu

Big Data Cogn. Comput. 2026, 10(5), 161; https://doi.org/10.3390/bdcc10050161 - 18 May 2026

Viewed by 176

Abstract

This study introduces the Quantum-inspired Pretrained Feature Embedding (ImprovedQPFE) model, a framework for dialogue sentiment classification. ImprovedQPFE integrates phase-pretrained complex embeddings, a bidirectional complex-valued GRU, a quantum-inspired attention mechanism, and supervised contrastive learning within a Transformer-based architecture, aiming to enhance feature discriminability under [...] Read more.

This study introduces the Quantum-inspired Pretrained Feature Embedding (ImprovedQPFE) model, a framework for dialogue sentiment classification. ImprovedQPFE integrates phase-pretrained complex embeddings, a bidirectional complex-valued GRU, a quantum-inspired attention mechanism, and supervised contrastive learning within a Transformer-based architecture, aiming to enhance feature discriminability under class imbalance. We evaluate ImprovedQPFE on the RECCON-DD and RECCON-IEM benchmarks under a unified and reproducible protocol, including standardized preprocessing and fixed data splits. To ensure reproducibility, all experiments were conducted using a fixed random seed of 42. The reported results are based on this single fixed-seed setting rather than averages over multiple repeated runs. The empirical results show that ImprovedQPFE achieves competitive performance and outperforms the compared baselines under the adopted experimental protocol. On the RECCON-DD dataset, ImprovedQPFE improves Macro-F1 from 80.08% to 83.75% compared with a strong non-quantum Transformer-based baseline equipped with contrastive learning. It also improves Pos-F1 while maintaining high performance for negative classes. On RECCON-IEM, ImprovedQPFE attains a leading Macro-F1 of 95.39% among the compared methods. These findings, together with an ablation analysis, support the effectiveness of the proposed quantum-inspired representation paradigm and its architectural components. However, further statistical validation with multiple repeated runs, standard deviations, confidence intervals, and significance testing remains an important direction for future work. Full article

(This article belongs to the Special Issue Advances in Natural Language Processing and Text Mining: 2nd Edition)

► Show Figures

Figure 1

29 pages, 25368 KB

Open AccessArticle

FedX: Privacy-Preserving Explainable Federated Ensemble Intrusion Detection System for Edge-Enabled Internet of Vehicles

by Nithya Nedungadi, Sriram Sankaran and Krishnashree Achuthan

Big Data Cogn. Comput. 2026, 10(5), 160; https://doi.org/10.3390/bdcc10050160 - 16 May 2026

Viewed by 352

Abstract

The evolution from the Internet of Things (IoT) to the Internet of Vehicles (IoV) has expanded intelligent connectivity across embedded systems while increasing cybersecurity risks arising from large scale data exchange and device heterogeneity. As IoV environments become more dynamic and safety critical, [...] Read more.

The evolution from the Internet of Things (IoT) to the Internet of Vehicles (IoV) has expanded intelligent connectivity across embedded systems while increasing cybersecurity risks arising from large scale data exchange and device heterogeneity. As IoV environments become more dynamic and safety critical, centralized Intrusion Detection Systems (IDSs) face constraints related to latency, privacy exposure, and bandwidth overhead. These limitations motivate a transition to edge-enabled IoV architectures, where localized vehicular and anchor nodes supported by edge servers enable decentralized processing, enhanced privacy, and reduced communication load. To address these operational challenges, this paper proposes FedX (Federated Explainable Ensemble Intrusion Detection System), a privacy-preserving and explainable federated ensemble IDS that integrates XGBoost and LightGBM models across resource-constrained edge vehicles and roadside units (RSUs) to enable collaborative, low-latency anomaly detection without sharing raw data. By applying adaptive weighting based on model confidence and resource availability, FedX enhances robustness and efficiency while enabling explainable decisions via SHAP and LIME analysis, which highlights reliance on key features (flow duration, speed, RPM) for high-confidence (>97%) intrusion alerts grounded in domain-specific behavior. Privacy is further enforced through Gaussian differential privacy and secure aggregation to mitigate inference and inversion attacks. Experiments on the CICIoV2024 dataset show that FedX achieves 99.1% accuracy, outperforming existing federated ensemble IDS models by up to 2.1%. The system reduces communication overhead by 17% relative to full synchronization through adaptive weighted transmission and secure aggregation. It maintains negligible accuracy loss (<1.5%) under a strong privacy budget (

ϵ

= 1.1). The deployment of proposed IDS on Raspberry Pi 4 underscores its efficacy for edge computing. Experimental results indicate that adaptive weighting yields a 1.8% performance increase, while resource profiling shows 45% lower CPU utilization and over 50% lower power consumption compared with centralized baselines. The findings demonstrate that FedX, combined with explainable AI enables trustworthy, interpretable, and energy-efficient intrusion detection for secure next-generation Edge-enabled IoV networks. Full article

(This article belongs to the Special Issue Big Data Analytics with Machine Learning for Cyber Security)

► Show Figures

Figure 1

20 pages, 3389 KB

Open AccessArticle

Teaching AI to Decode Vaccine Hesitancy Narratives: A Few-Shot Learning and Topic Modeling Approach

by Md Enamul Kabir, Shakhawat H. Tanim, Deanna D. Sellnow, Geneva Lei P. Luteria and Lior Rennert

Big Data Cogn. Comput. 2026, 10(5), 159; https://doi.org/10.3390/bdcc10050159 - 16 May 2026

Viewed by 205

Abstract

Vaccine hesitancy—which can be defined as a delay in acceptance or the refusal to get vaccinated—has substantially increased over the past decade. This study introduces a computational and qualitative approach designed to efficiently classify stance and uncover narratives in social media discourse without [...] Read more.

Vaccine hesitancy—which can be defined as a delay in acceptance or the refusal to get vaccinated—has substantially increased over the past decade. This study introduces a computational and qualitative approach designed to efficiently classify stance and uncover narratives in social media discourse without relying on extensive manual annotation. Using 298,356 COVID-19 vaccine-related X posts geolocated to South Carolina (June 2021–May 2022), zero-shot and few-shot learning with instruction-tuned large language models (Mistral-7B, Meta-Llama-3.1, and DeepSeek-7B) was applied for stance detection while Latent Dirichlet Allocation (LDA) was used for topic modeling. The topic modeling identified five dominant themes in vaccine hesitant conversations: skepticism of vaccine efficacy, comparative framing, scientific justification, disapproval of regulations, and distrust. Temporal analysis revealed that skepticism peaked during late 2021, coinciding with booster campaigns and mandate debates. These findings suggest that vaccine hesitancy is influenced through complex rhetorical strategies rather than misinformation alone. These underlying narratives often frame skepticism as rational and evidence-based, using scientific language and statistical reasoning to challenge the effectiveness of vaccines. Full article

► Show Figures

Figure 1

47 pages, 1590 KB

Open AccessArticle

A Hybrid PoS–PoW Blockchain Framework for Secure Cyber Threat Intelligence Sharing: Design, Implementation, and Evaluation

by Ahmed El-Kosairy and Heba Kamal Aslan

Big Data Cogn. Comput. 2026, 10(5), 158; https://doi.org/10.3390/bdcc10050158 - 15 May 2026

Viewed by 382

Abstract

Many blockchain-based cyber threat intelligence (CTI) sharing systems emphasize immutability and auditability, but often treat CTI submissions as ordinary blockchain transactions without explicitly separating content validation from publication anchoring. This paper presents CTIB, a proof-of-concept hybrid Proof-of-Stake (PoS) and Proof-of-Work (PoW) framework for [...] Read more.

Many blockchain-based cyber threat intelligence (CTI) sharing systems emphasize immutability and auditability, but often treat CTI submissions as ordinary blockchain transactions without explicitly separating content validation from publication anchoring. This paper presents CTIB, a proof-of-concept hybrid Proof-of-Stake (PoS) and Proof-of-Work (PoW) framework for CTI publication. CTIB uses a sequential workflow in which a PoS committee first evaluates CTI submissions, and an accepted feed hash is then anchored through a PoW step to provide verifiable temporal binding. The prototype is evaluated in a controlled local Hardhat environment; therefore, the results should be interpreted as prototype-level feasibility evidence rather than production-scale deployment results. CTI content is represented using STIX 2.1, canonicalized, and hashed using SHA-256; only integrity-critical evidence is stored on-chain, while full CTI content remains off-chain. Experimental results demonstrate prototype-level feasibility, with measured throughput, latency, and success rate metrics under different PoW difficulty profiles. Across ten independent local runs, CTIB achieved an average throughput between 141.13 and 166.14 feeds/min, average p50 latency between 326.18 and 403.09 ms, and average p95 latency between 553.22 and 700.82 ms under the tested difficulty profiles. Security analysis uses analytical modeling, committee capture probability, and Monte Carlo simulation to evaluate majority-attack feasibility under stated assumptions. The results indicate that sequential compromise of both validation and anchoring layers increases the cost of coordinated manipulation. Full article

► Show Figures

Graphical abstract

18 pages, 1508 KB

Open AccessArticle

PRL-DAS: Robust Heliox Speech Recognition for Unaligned Low-Resource Data

by Yonghong Chen, Guoqi Zhang, Wanzhi Wen and Shibing Zhang

Big Data Cogn. Comput. 2026, 10(5), 157; https://doi.org/10.3390/bdcc10050157 - 15 May 2026

Viewed by 218

Abstract

Speech produced in helium–oxygen (heliox) environments in deep saturation diving exhibits pronounced spectral shifts and temporal distortions, which severely degrade automatic speech recognition (ASR) systems trained on normal-air corpora. Existing studies often adopt a restoration-then-recognition paradigm by training waveform mapping networks on paired [...] Read more.

Speech produced in helium–oxygen (heliox) environments in deep saturation diving exhibits pronounced spectral shifts and temporal distortions, which severely degrade automatic speech recognition (ASR) systems trained on normal-air corpora. Existing studies often adopt a restoration-then-recognition paradigm by training waveform mapping networks on paired heliox/air recordings. However, in realistic low-resource data collection, paired recordings are typically obtained by independent re-reading and are therefore not strictly time-aligned, which makes regression-style restoration more sensitive to pairing errors and increases the risk of front-end distortions. This paper proposes a robust recognition framework for heliox speech, termed PRL-DAS (Physics-informed Resampling and LoRA with Duration-Adaptive Speed). The framework consists of a physics-inspired linear resampling warm start (PhysSpeed), parameter-efficient Low-Rank Adaptation (LoRA), and duration-adaptive speed (DAS) inference enhancement. Specifically, we first apply physics-motivated linear resampling as a coarse warm start, and then perform mixed-domain LoRA fine-tuning of a Whisper foundation model to absorb residual non-linear differences. On a corpus of 1048 paired Chinese heliox utterances under leave-one-speaker-out (LOSO) evaluation, using Whisper-Medium as the base model, PhysSpeed followed by mixed-domain LoRA reduces the overall character error rate (CER) from 49.33% with PhysSpeed preprocessing only to 25.79%, while also improving performance on the normal domain. Furthermore, the full PRL-DAS framework applies Soft-DAS, a lightweight smooth schedule motivated by duration-dependent variation in the optimal resampling factor, and further reduces the overall CER to 24.37% without additional training cost. Full article

(This article belongs to the Section Data Mining and Machine Learning)

► Show Figures

Figure 1

Search Results (1,260)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Article Types

Countries / Regions

Search Results (1,260)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI