-
Lean Analytics and Industry 6.0 for Antifragile and Generative AI-Orchestrated Manufacturing Ecosystems -
Assessing Scalability and Transfer Learning in Urban Scene Segmentation with Explainable AI -
Big Data Analytics and AI for Consumer Behavior in Digital Marketing -
Explaining Complex Agent-Based Simulations as Text via Large Language Models -
Interpretable Optimized XGBoost for Predicting Higher Heating Value from Coal Elemental Composition in Energy Conversion
Journal Description
Big Data and Cognitive Computing
Big Data and Cognitive Computing
is an international, peer-reviewed, open access journal on big data and cognitive computing published monthly online by MDPI.
- Open Access— free for readers, with article processing charges (APC) paid by authors or their institutions.
- High Visibility: indexed within Scopus, ESCI (Web of Science), dblp, Inspec, Ei Compendex, and other databases.
- Journal Rank: JCR - Q1 (Computer Science, Theory and Methods) / CiteScore - Q1 (Computer Science Applications)
- Rapid Publication: manuscripts are peer-reviewed and a first decision is provided to authors approximately 23.1 days after submission; acceptance to publication is undertaken in 4.6 days (median values for papers published in this journal in the second half of 2025).
- Recognition of Reviewers: reviewers who provide timely, thorough peer-review reports receive vouchers entitling them to a discount on the APC of their next publication in any MDPI journal, in appreciation of the work done.
- Journal Cluster of Artificial Intelligence: AI, AI in Medicine, Algorithms, BDCC, MAKE, MTI, Stats, Virtual Worlds and Computers.
Impact Factor:
4.4 (2024);
5-Year Impact Factor:
4.2 (2024)
Latest Articles
Blockchains for Data Management: The DIGI4ECO Use Case and Practical Lessons Beyond Theory
Big Data Cogn. Comput. 2026, 10(5), 162; https://doi.org/10.3390/bdcc10050162 - 18 May 2026
Abstract
This article examines blockchain as an enabling technological component for data management tasks that are independent of currency-related functionality, a less-discussed aspect of a technology commonly associated with cryptocurrencies and decentralized finance (DeFi). Drawing on empirical findings from the DIGI4ECO project as a
[...] Read more.
This article examines blockchain as an enabling technological component for data management tasks that are independent of currency-related functionality, a less-discussed aspect of a technology commonly associated with cryptocurrencies and decentralized finance (DeFi). Drawing on empirical findings from the DIGI4ECO project as a case study, we present a structured literature review and cross-domain analysis of blockchain-based data management systems (BDMSs), examine a representative permissioned BDMS implementation, and synthesize practical design guidelines and implementation insights for BDMS development. This perspective is motivated by core blockchain properties such as immutability and transparency, as well as by the observation that existing resources for BDMS development, including methods, tools, and best practices, remain fragmented and less developed than those available for more mature technologies.
Full article
(This article belongs to the Section Big Data)
►
Show Figures
Open AccessArticle
Enhanced Quantum-Inspired Deep Learning with Multi-Head Attention and Contrastive Learning for Text-Based Dialogue Sentiment Classification
by
Fumin Zou, Lei Zou, Feng Guo, Xunhuang Wang, Jianqing Weng, Tao Fang, Haocai Jiang and Xueming Wu
Big Data Cogn. Comput. 2026, 10(5), 161; https://doi.org/10.3390/bdcc10050161 - 18 May 2026
Abstract
This study introduces the Quantum-inspired Pretrained Feature Embedding (ImprovedQPFE) model, a framework for dialogue sentiment classification. ImprovedQPFE integrates phase-pretrained complex embeddings, a bidirectional complex-valued GRU, a quantum-inspired attention mechanism, and supervised contrastive learning within a Transformer-based architecture, aiming to enhance feature discriminability under
[...] Read more.
This study introduces the Quantum-inspired Pretrained Feature Embedding (ImprovedQPFE) model, a framework for dialogue sentiment classification. ImprovedQPFE integrates phase-pretrained complex embeddings, a bidirectional complex-valued GRU, a quantum-inspired attention mechanism, and supervised contrastive learning within a Transformer-based architecture, aiming to enhance feature discriminability under class imbalance. We evaluate ImprovedQPFE on the RECCON-DD and RECCON-IEM benchmarks under a unified and reproducible protocol, including standardized preprocessing and fixed data splits. To ensure reproducibility, all experiments were conducted using a fixed random seed of 42. The reported results are based on this single fixed-seed setting rather than averages over multiple repeated runs. The empirical results show that ImprovedQPFE achieves competitive performance and outperforms the compared baselines under the adopted experimental protocol. On the RECCON-DD dataset, ImprovedQPFE improves Macro-F1 from 80.08% to 83.75% compared with a strong non-quantum Transformer-based baseline equipped with contrastive learning. It also improves Pos-F1 while maintaining high performance for negative classes. On RECCON-IEM, ImprovedQPFE attains a leading Macro-F1 of 95.39% among the compared methods. These findings, together with an ablation analysis, support the effectiveness of the proposed quantum-inspired representation paradigm and its architectural components. However, further statistical validation with multiple repeated runs, standard deviations, confidence intervals, and significance testing remains an important direction for future work.
Full article
(This article belongs to the Special Issue Advances in Natural Language Processing and Text Mining: 2nd Edition)
Open AccessArticle
FedX: Privacy-Preserving Explainable Federated Ensemble Intrusion Detection System for Edge-Enabled Internet of Vehicles
by
Nithya Nedungadi, Sriram Sankaran and Krishnashree Achuthan
Big Data Cogn. Comput. 2026, 10(5), 160; https://doi.org/10.3390/bdcc10050160 - 16 May 2026
Abstract
The evolution from the Internet of Things (IoT) to the Internet of Vehicles (IoV) has expanded intelligent connectivity across embedded systems while increasing cybersecurity risks arising from large scale data exchange and device heterogeneity. As IoV environments become more dynamic and safety critical,
[...] Read more.
The evolution from the Internet of Things (IoT) to the Internet of Vehicles (IoV) has expanded intelligent connectivity across embedded systems while increasing cybersecurity risks arising from large scale data exchange and device heterogeneity. As IoV environments become more dynamic and safety critical, centralized Intrusion Detection Systems (IDSs) face constraints related to latency, privacy exposure, and bandwidth overhead. These limitations motivate a transition to edge-enabled IoV architectures, where localized vehicular and anchor nodes supported by edge servers enable decentralized processing, enhanced privacy, and reduced communication load. To address these operational challenges, this paper proposes FedX (Federated Explainable Ensemble Intrusion Detection System), a privacy-preserving and explainable federated ensemble IDS that integrates XGBoost and LightGBM models across resource-constrained edge vehicles and roadside units (RSUs) to enable collaborative, low-latency anomaly detection without sharing raw data. By applying adaptive weighting based on model confidence and resource availability, FedX enhances robustness and efficiency while enabling explainable decisions via SHAP and LIME analysis, which highlights reliance on key features (flow duration, speed, RPM) for high-confidence (>97%) intrusion alerts grounded in domain-specific behavior. Privacy is further enforced through Gaussian differential privacy and secure aggregation to mitigate inference and inversion attacks. Experiments on the CICIoV2024 dataset show that FedX achieves 99.1% accuracy, outperforming existing federated ensemble IDS models by up to 2.1%. The system reduces communication overhead by 17% relative to full synchronization through adaptive weighted transmission and secure aggregation. It maintains negligible accuracy loss (<1.5%) under a strong privacy budget ( = 1.1). The deployment of proposed IDS on Raspberry Pi 4 underscores its efficacy for edge computing. Experimental results indicate that adaptive weighting yields a 1.8% performance increase, while resource profiling shows 45% lower CPU utilization and over 50% lower power consumption compared with centralized baselines. The findings demonstrate that FedX, combined with explainable AI enables trustworthy, interpretable, and energy-efficient intrusion detection for secure next-generation Edge-enabled IoV networks.
Full article
(This article belongs to the Special Issue Big Data Analytics with Machine Learning for Cyber Security)
►▼
Show Figures

Figure 1
Open AccessArticle
Teaching AI to Decode Vaccine Hesitancy Narratives: A Few-Shot Learning and Topic Modeling Approach
by
Md Enamul Kabir, Shakhawat H. Tanim, Deanna D. Sellnow, Geneva Lei P. Luteria and Lior Rennert
Big Data Cogn. Comput. 2026, 10(5), 159; https://doi.org/10.3390/bdcc10050159 - 16 May 2026
Abstract
Vaccine hesitancy—which can be defined as a delay in acceptance or the refusal to get vaccinated—has substantially increased over the past decade. This study introduces a computational and qualitative approach designed to efficiently classify stance and uncover narratives in social media discourse without
[...] Read more.
Vaccine hesitancy—which can be defined as a delay in acceptance or the refusal to get vaccinated—has substantially increased over the past decade. This study introduces a computational and qualitative approach designed to efficiently classify stance and uncover narratives in social media discourse without relying on extensive manual annotation. Using 298,356 COVID-19 vaccine-related X posts geolocated to South Carolina (June 2021–May 2022), zero-shot and few-shot learning with instruction-tuned large language models (Mistral-7B, Meta-Llama-3.1, and DeepSeek-7B) was applied for stance detection while Latent Dirichlet Allocation (LDA) was used for topic modeling. The topic modeling identified five dominant themes in vaccine hesitant conversations: skepticism of vaccine efficacy, comparative framing, scientific justification, disapproval of regulations, and distrust. Temporal analysis revealed that skepticism peaked during late 2021, coinciding with booster campaigns and mandate debates. These findings suggest that vaccine hesitancy is influenced through complex rhetorical strategies rather than misinformation alone. These underlying narratives often frame skepticism as rational and evidence-based, using scientific language and statistical reasoning to challenge the effectiveness of vaccines.
Full article
Open AccessArticle
A Hybrid PoS–PoW Blockchain Framework for Secure Cyber Threat Intelligence Sharing: Design, Implementation, and Evaluation
by
Ahmed El-Kosairy and Heba Kamal Aslan
Big Data Cogn. Comput. 2026, 10(5), 158; https://doi.org/10.3390/bdcc10050158 - 15 May 2026
Abstract
►▼
Show Figures
Many blockchain-based cyber threat intelligence (CTI) sharing systems emphasize immutability and auditability, but often treat CTI submissions as ordinary blockchain transactions without explicitly separating content validation from publication anchoring. This paper presents CTIB, a proof-of-concept hybrid Proof-of-Stake (PoS) and Proof-of-Work (PoW) framework for
[...] Read more.
Many blockchain-based cyber threat intelligence (CTI) sharing systems emphasize immutability and auditability, but often treat CTI submissions as ordinary blockchain transactions without explicitly separating content validation from publication anchoring. This paper presents CTIB, a proof-of-concept hybrid Proof-of-Stake (PoS) and Proof-of-Work (PoW) framework for CTI publication. CTIB uses a sequential workflow in which a PoS committee first evaluates CTI submissions, and an accepted feed hash is then anchored through a PoW step to provide verifiable temporal binding. The prototype is evaluated in a controlled local Hardhat environment; therefore, the results should be interpreted as prototype-level feasibility evidence rather than production-scale deployment results. CTI content is represented using STIX 2.1, canonicalized, and hashed using SHA-256; only integrity-critical evidence is stored on-chain, while full CTI content remains off-chain. Experimental results demonstrate prototype-level feasibility, with measured throughput, latency, and success rate metrics under different PoW difficulty profiles. Across ten independent local runs, CTIB achieved an average throughput between 141.13 and 166.14 feeds/min, average p50 latency between 326.18 and 403.09 ms, and average p95 latency between 553.22 and 700.82 ms under the tested difficulty profiles. Security analysis uses analytical modeling, committee capture probability, and Monte Carlo simulation to evaluate majority-attack feasibility under stated assumptions. The results indicate that sequential compromise of both validation and anchoring layers increases the cost of coordinated manipulation.
Full article

Graphical abstract
Open AccessArticle
PRL-DAS: Robust Heliox Speech Recognition for Unaligned Low-Resource Data
by
Yonghong Chen, Guoqi Zhang, Wanzhi Wen and Shibing Zhang
Big Data Cogn. Comput. 2026, 10(5), 157; https://doi.org/10.3390/bdcc10050157 - 15 May 2026
Abstract
Speech produced in helium–oxygen (heliox) environments in deep saturation diving exhibits pronounced spectral shifts and temporal distortions, which severely degrade automatic speech recognition (ASR) systems trained on normal-air corpora. Existing studies often adopt a restoration-then-recognition paradigm by training waveform mapping networks on paired
[...] Read more.
Speech produced in helium–oxygen (heliox) environments in deep saturation diving exhibits pronounced spectral shifts and temporal distortions, which severely degrade automatic speech recognition (ASR) systems trained on normal-air corpora. Existing studies often adopt a restoration-then-recognition paradigm by training waveform mapping networks on paired heliox/air recordings. However, in realistic low-resource data collection, paired recordings are typically obtained by independent re-reading and are therefore not strictly time-aligned, which makes regression-style restoration more sensitive to pairing errors and increases the risk of front-end distortions. This paper proposes a robust recognition framework for heliox speech, termed PRL-DAS (Physics-informed Resampling and LoRA with Duration-Adaptive Speed). The framework consists of a physics-inspired linear resampling warm start (PhysSpeed), parameter-efficient Low-Rank Adaptation (LoRA), and duration-adaptive speed (DAS) inference enhancement. Specifically, we first apply physics-motivated linear resampling as a coarse warm start, and then perform mixed-domain LoRA fine-tuning of a Whisper foundation model to absorb residual non-linear differences. On a corpus of 1048 paired Chinese heliox utterances under leave-one-speaker-out (LOSO) evaluation, using Whisper-Medium as the base model, PhysSpeed followed by mixed-domain LoRA reduces the overall character error rate (CER) from 49.33% with PhysSpeed preprocessing only to 25.79%, while also improving performance on the normal domain. Furthermore, the full PRL-DAS framework applies Soft-DAS, a lightweight smooth schedule motivated by duration-dependent variation in the optimal resampling factor, and further reduces the overall CER to 24.37% without additional training cost.
Full article
(This article belongs to the Section Data Mining and Machine Learning)
►▼
Show Figures

Figure 1
Open AccessArticle
A Three-Tier Hybrid Architecture for an Admissions Dialogue Assistant with Graph-Aware Context Routing
by
Nikita Stepanov, Anastasiya Radaeva, Peter Panfilov, Alexander Suleykin and Valery Pyatetsky
Big Data Cogn. Comput. 2026, 10(5), 156; https://doi.org/10.3390/bdcc10050156 - 15 May 2026
Abstract
University admissions services must answer large volumes of applicant questions that differ substantially in complexity, ranging from repetitive FAQ-type requests to multi-step questions involving programs, entrance exams, admission rules, passing scores, and temporal comparisons. Ungrounded large language model responses are risky in this
[...] Read more.
University admissions services must answer large volumes of applicant questions that differ substantially in complexity, ranging from repetitive FAQ-type requests to multi-step questions involving programs, entrance exams, admission rules, passing scores, and temporal comparisons. Ungrounded large language model responses are risky in this domain because answers must be factually correct, source-based, and consistent with official institutional data. This paper presents a three-tier hybrid architecture for an admissions dialogue assistant that combines deterministic FAQ matching, hybrid retrieval-augmented generation, and graph-grounded retrieval for complex queries. The first tier, Hash-FAQ, returns verified answers for frequent intents using normalized keys, hash-based lookup, near-duplicate fingerprinting, and semantic similarity checks. The second tier applies hybrid RAG based on BM25 retrieval, vector search, rank fusion, and optional cross-encoder reranking. The third tier uses GraphRAG to extract a constrained k-hop subgraph from a Neo4j knowledge graph built from relational admissions data and document-derived facts. All tiers are synchronized through a versioned indexing pipeline with shadow collections and atomic switching across lexical, vector, FAQ, relational, and graph stores. The system was evaluated using real admissions-campaign traffic and a labeled subset of applicant queries. Tier 1 resolved 68.7% of requests with low latency, while the GraphRAG branch improved factual accuracy with attribution on multi-step queries from 0.55 to 0.91 compared with the non-graph baseline. The main contribution of the study is a production-oriented, cost-aware retrieval-and-generation architecture that links tiered routing, synchronized knowledge publication, source attribution, and operational evaluation for applicant-facing institutional dialogue systems.
Full article
(This article belongs to the Topic Electronic Communications, IOT and Big Data, 2nd Volume)
►▼
Show Figures

Figure 1
Open AccessArticle
Performance Trade-Offs of Optimizing Small Language Models for E-Commerce
by
Josip Tomo Licardo, Nikola Tanković, Ivan Osman, Ivan Lorencin and Sandi Baressi Šegota
Big Data Cogn. Comput. 2026, 10(5), 155; https://doi.org/10.3390/bdcc10050155 - 14 May 2026
Abstract
►▼
Show Figures
Large Language Models (LLMs) offer state-of-the-art performance in natural language understanding and generation tasks. However, the deployment of leading commercial models for specialized tasks, such as e-commerce, is often hindered by high computational costs, latency, and operational expenses. This paper investigates the viability
[...] Read more.
Large Language Models (LLMs) offer state-of-the-art performance in natural language understanding and generation tasks. However, the deployment of leading commercial models for specialized tasks, such as e-commerce, is often hindered by high computational costs, latency, and operational expenses. This paper investigates the viability of smaller, open-weight models as a resource-efficient alternative. We present a methodology for optimizing a one-billion-parameter Llama 3.2 model for multilingual e-commerce intent recognition. The model was fine-tuned using Quantized Low-Rank Adaptation (QLoRA) on a synthetically generated dataset designed to mimic real-world user queries. Subsequently, we applied post-training quantization techniques, creating GPU-optimized (GPTQ) and CPU-optimized (GGUF) versions. Our results demonstrate that the specialized 1B model achieves 98.8% accuracy, approaching the performance of the significantly larger GPT-4.1 model. A detailed performance analysis revealed critical, hardware-dependent trade-offs: while 4-bit GPTQ reduced VRAM usage by 41%, it paradoxically slowed inference by 82% on an older GPU architecture (NVIDIA T4) due to dequantization overhead. Conversely, GGUF formats on a CPU achieved a speedup of up to in inference throughput and up to a 72% reduction in RAM consumption compared to the FP16 baseline. We conclude that small, properly optimized open-weight models are not just a viable but a more suitable alternative for domain-specific applications, offering state-of-the-art accuracy at a fraction of the computational cost.
Full article

Figure 1
Open AccessArticle
Consensus-Driven Framework for Data-Driven Optimization of Distributed Systems Through Blockchain Consensus Mechanism Selection
by
Miljenko Švarcmajer, Mirko Kohler, Zdravko Krpić and Ivica Lukić
Big Data Cogn. Comput. 2026, 10(5), 154; https://doi.org/10.3390/bdcc10050154 - 13 May 2026
Abstract
Modern data-driven distributed systems increasingly rely on blockchain technologies to ensure trust, transparency, and decentralized coordination. However, the rapid proliferation of consensus mechanisms has created a complex design space, making the selection of an appropriate protocol a non-trivial architectural and decision-making challenge. Different
[...] Read more.
Modern data-driven distributed systems increasingly rely on blockchain technologies to ensure trust, transparency, and decentralized coordination. However, the rapid proliferation of consensus mechanisms has created a complex design space, making the selection of an appropriate protocol a non-trivial architectural and decision-making challenge. Different consensus mechanisms rely on distinct security resources, validator admission models, and agreement architectures, leading to diverse trade-offs between scalability, decentralization, performance, and governance. Existing studies primarily focus on classification or performance comparison of consensus mechanisms, while the problem of systematic, requirement-driven selection remains insufficiently addressed. In particular, there is a lack of structured approaches that integrate multiple system requirements into a unified decision framework suitable for real-world environments. To address this gap, this paper proposes a consensus-driven, layered framework for blockchain consensus mechanism selection, formulated as a multi-criteria decision problem. The framework organizes the consensus design space across key architectural dimensions and analyzes 32 consensus mechanisms, enabling systematic comparison and supporting data-driven decision-making. The approach is further demonstrated through five representative use-case scenarios, showing its applicability in optimizing distributed system design.
Full article
(This article belongs to the Special Issue AI and Blockchain for Trustworthy Social Computing)
►▼
Show Figures

Figure 1
Open AccessArticle
A Dueling DQN-Based Hyper-Heuristic Framework for Learning Path Optimization
by
Yong-Wei Zhang, Ming-Yang Zhu, Wen-Kai Xia, Xin-Yang Zhang and Jin-Di Liu
Big Data Cogn. Comput. 2026, 10(5), 153; https://doi.org/10.3390/bdcc10050153 - 13 May 2026
Abstract
Learning path optimization is crucial in intelligent educational systems, with the core challenge of efficient multi-objective sequential decision-making under complex prerequisite constraints. To address the poor generalization of existing methods relying on fixed operator scheduling or handcrafted heuristics, this paper proposes a hyper-heuristic
[...] Read more.
Learning path optimization is crucial in intelligent educational systems, with the core challenge of efficient multi-objective sequential decision-making under complex prerequisite constraints. To address the poor generalization of existing methods relying on fixed operator scheduling or handcrafted heuristics, this paper proposes a hyper-heuristic framework based on Dueling Deep Q-Network (Dueling DQN-HH), formulating operator selection as a sequential decision-making process for dynamic adaptive scheduling of low-level operators. The framework adopts priority-based encoding to unify learning path representation (decoupling the hyper-heuristic layer from the problem domain) and designs a composite reward mechanism integrating reward shaping, exploration incentives, and computational cost awareness to balance solution quality and efficiency. Additionally, it employs a dueling network architecture with prioritized experience replay to enhance policy learning stability. Experimental results show the proposed method outperforms representative baseline algorithms in solution quality, convergence stability, and computational efficiency. The framework demonstrates superior performance across multiple objectives, particularly in minimizing the total learning time (Ftime), as validated on two heterogeneous datasets: MOOCCube (Computer Science) and PsyDataset (Psychology). Further ablation studies and operator evolution analyses verify its adaptive scheduling capability under different objectives and knowledge graph structures, demonstrating strong objective independence and cross-dataset generalization.
Full article
(This article belongs to the Section Data Mining and Machine Learning)
►▼
Show Figures

Figure 1
Open AccessArticle
Data-Driven Peak Demand Identification in Commercial Electricity Consumption for Load Curve Flattening
by
Michał Gostkowski, Tomasz Ząbkowski and Krzysztof Gajowniczek
Big Data Cogn. Comput. 2026, 10(5), 152; https://doi.org/10.3390/bdcc10050152 - 12 May 2026
Abstract
Effective peak load management enables utilities to mitigate increased electricity demand and optimize the use of available resources during periods of maximum consumption. Accurate forecasting of the peak load is essential for ensuring the reliability, efficiency, and resilience of contemporary power systems. In
[...] Read more.
Effective peak load management enables utilities to mitigate increased electricity demand and optimize the use of available resources during periods of maximum consumption. Accurate forecasting of the peak load is essential for ensuring the reliability, efficiency, and resilience of contemporary power systems. In this study, commercial customer-level data were employed to identify electricity peak demand within the Polish power system, drawing upon historical records of both energy consumption and meteorological variables. Departing from conventional time series forecasting approaches, the problem was intentionally reformulated as a pattern recognition task. Three classification techniques were systematically evaluated to identify individual customers’ peak load events, thereby offering a basis for demand-side management strategies and incentive mechanisms aimed at flattening load profiles and improving grid stability. The proposed approach demonstrates how data-driven analytics can support utilities in extracting actionable knowledge from large-scale energy datasets and enabling proactive demand response programs. Empirical results indicate that the proposed methods are capable of predicting up to 90% of electricity peak occurrences, with a forecasting horizon of 24 h leading to significant shifts in the load curve.
Full article
(This article belongs to the Section Data Mining and Machine Learning)
►▼
Show Figures

Figure 1
Open AccessArticle
A Comparative Evaluation of AI Approaches to Large-Scale Scientific Subject Classification
by
Roland Tanácsi and András Micsik
Big Data Cogn. Comput. 2026, 10(5), 151; https://doi.org/10.3390/bdcc10050151 - 11 May 2026
Abstract
Background: The Hungarian Science Bibliography applies the OECD Frascati Fields of Science and Technology taxonomy for subject classification; however, approximately 80% of its records lack assigned categories. Automated large-scale classification could support retrospective completion and improve the quality of bibliographic data. Methods: We
[...] Read more.
Background: The Hungarian Science Bibliography applies the OECD Frascati Fields of Science and Technology taxonomy for subject classification; however, approximately 80% of its records lack assigned categories. Automated large-scale classification could support retrospective completion and improve the quality of bibliographic data. Methods: We evaluated multiple artificial intelligence approaches to classifying publications into level 4 Frascati categories using only titles and keywords. Training datasets were compiled from bibliographic records and subjected to heuristic and large-language-model-based filtering to reduce noise and ambiguity. The approaches tested included statistical methods, classical machine learning classifiers, fine-tuned SciBERT models, zero-shot prompting with large language models, and a Mixture-of-Experts architecture. Results: Data quality had a stronger impact on performance than model complexity. Large-language-model-based filtering substantially improved classification results. The best-performing model, a Support Vector Classifier, achieved a weighted F1 score of 0.83, which is an outstanding result relative to state-of-the-art approaches from the literature. Conclusions: Our findings contribute new insights into classification research and may assist others in selecting appropriate solutions for real-world, large-scale bibliographic classification tasks.
Full article
(This article belongs to the Topic Generative AI and Interdisciplinary Applications)
Open AccessArticle
BDERL: A Reinforcement Learning-Enhanced Differential Evolution for the Earliness–Tardiness RCPSP
by
Hao Nguyen Thi, Loc Nguyen The and Huu Dang Quoc
Big Data Cogn. Comput. 2026, 10(5), 150; https://doi.org/10.3390/bdcc10050150 - 11 May 2026
Abstract
This paper introduces the ETMS-RCPSP (Earliness–Tardiness Multi-Skill Resource-Constrained Scheduling Problem)—a novel problem derived from the MS-RCPSP by adding constraints on project completion time or actual production contracts. The goal of the new problem is to control the project completion time as closely as
[...] Read more.
This paper introduces the ETMS-RCPSP (Earliness–Tardiness Multi-Skill Resource-Constrained Scheduling Problem)—a novel problem derived from the MS-RCPSP by adding constraints on project completion time or actual production contracts. The goal of the new problem is to control the project completion time as closely as possible to reality—this differs from the original MS-RCPSP, which aimed to minimize project execution time. The objective of the problem is of greater practical significance in ensuring project completion on schedule while also addressing related issues, such as the ability to receive finished products on time as stipulated in the contract. The ETMS-RCPSP is an NP-hard problem whose result can be used for resource allocation in project execution or for resource arrangement in production lines to fulfill economic contracts. To address the ETMS-RCPSP, the paper proposes a new evolutionary algorithm, BDERL (Balanced Differential Evolution with Reinforcement Learning), that combines differential evolution with a problem-specific decoding mechanism and an adaptive parameter control strategy based on reinforcement learning (Q-learning). The proposed algorithm is evaluated on benchmark instances derived from the iMOPSE dataset and the TNG company dataset—a real-world dataset from manufacturing and contract-driven environments. Experimental results demonstrate that the approach consistently reduces total production costs compared to baseline heuristics while maintaining competitive computational efficiency. The findings underscore the efficacy of adaptive hybrid optimization techniques in solving intricate production scheduling problems characterized by limited resources and varied skill competencies.
Full article
(This article belongs to the Special Issue Smart Manufacturing in the AI Era)
Open AccessArticle
MDA-Net: A Segmentation Network for Kidney Tumor Based on Enhanced Multi-Scale Feature Extraction and Attention Refinement
by
Shaofu Lin, Yumiao Chang, Jianhui Chen and Lianfang Ma
Big Data Cogn. Comput. 2026, 10(5), 149; https://doi.org/10.3390/bdcc10050149 - 8 May 2026
Abstract
Accurate kidney tumor segmentation from abdominal CT is essential for quantitative assessment and treatment planning. However, indistinct tumor boundaries and substantial inter-patient shape variability render traditional hand-crafted feature-based methods unreliable for precise delineation. Although deep learning has advanced this task, these methods still
[...] Read more.
Accurate kidney tumor segmentation from abdominal CT is essential for quantitative assessment and treatment planning. However, indistinct tumor boundaries and substantial inter-patient shape variability render traditional hand-crafted feature-based methods unreliable for precise delineation. Although deep learning has advanced this task, these methods still struggle with multi-scale tumor characteristics, complex morphological variations, and background noise in medical images. To address these challenges, we propose MDA-Net, an end-to-end segmentation method based on enhanced multi-scale feature extraction and attention refinement. Specifically, we introduce a Multi-Scale Feature Extraction (MSFE) module into encoder–decoder skip connections to aggregate dilated features across multiple receptive fields and learn branch-wise weights for adaptive refinement and fusion, thereby enhancing boundary details and semantic cues to reduce tumor-tissue ambiguity. At the bottleneck, a Deformable Pyramid Feature Refinement (DPFR) module combines deformable sampling with pyramid contextual modeling, thereby improving adaptability to variations in tumor shape and scale while preserving feature resolution. Moreover, a Channel and Spatial Attention (CASA) module is embedded in the decoder to suppress background interference and enhance boundary-sensitive structures during upsampling via coordinated channel and spatial reweighting, thereby improving the reconstruction of fine-grained tumor morphology and contours. Experiments on both KiTS19 and KiTS21 show that MDA-Net consistently improves tumor boundary delineation, lesion localization, and mask reconstruction, demonstrating stronger robustness and cross-dataset generalizability than representative baseline methods. Ablation studies further confirm the complementary effects of MSFE, DPFR, and CASA. In addition, Grad-CAM visualizations improve the clinical transparency and interpretability of the model. Overall, this method advances deep learning for medical image analysis and supports precise diagnosis and treatment of renal tumors.
Full article
(This article belongs to the Special Issue Deep Learning for Advanced Visual Representation and Analysis)
Open AccessArticle
Adaptive Neural Network System for Preventing Violations of Personal Digital Rights as a National Security Factor
by
Serhii Vladov, Oksana Mulesa, Maryana Marusinets, Tiberiy Chegi, Victoria Vysotska, Anton Kazakov, Iryna Kirieieva, Maksym Korniienko and Tetiana Morhunova
Big Data Cogn. Comput. 2026, 10(5), 148; https://doi.org/10.3390/bdcc10050148 - 8 May 2026
Abstract
The article develops a hybrid multimodal neural network for the automatic prevention of personal digital rights violations, focusing on improving security through anomaly detection and ensuring data confidentiality. The main aim is to integrate several innovative methods, such as federated learning, gating, latent
[...] Read more.
The article develops a hybrid multimodal neural network for the automatic prevention of personal digital rights violations, focusing on improving security through anomaly detection and ensuring data confidentiality. The main aim is to integrate several innovative methods, such as federated learning, gating, latent competitive learning, and a variational autoencoder, to improve violation detection accuracy. The key contribution is the development of a training mixture that combines a probabilistic anomaly detector and an autoencoder reconstruction signal, which allows for effective detection of typical incidents and hidden anomalies. The experimental evaluation results showed high-performance indicators, with ROC-AUC at 0.96 and accuracy at 0.94, confirming the system’s effectiveness on anonymized data. The results obtained have a significant practical contribution, as they can be integrated into national information security systems, including SOC and forensic reports, which will ensure a higher level of personal data protection and reduce privacy breach risks. The scope of the proposed system simultaneously covers cybersecurity, personal data protection, national security, SOC systems, and forensic analysis.
Full article
(This article belongs to the Special Issue Internet Intelligence for Cybersecurity)
►▼
Show Figures

Figure 1
Open AccessArticle
Federated Learning-Based Adaptive Multi-Head Attention Model for Wind Power Forecasting
by
Yihua Zhu, Chao Luo, Ke Wu, Jiawei Yu, Binjiang Hu, Lei Huang and Bitao Xiao
Big Data Cogn. Comput. 2026, 10(5), 147; https://doi.org/10.3390/bdcc10050147 - 7 May 2026
Abstract
►▼
Show Figures
Enhancing the accuracy of short-term wind power forecasting helps mitigate the adverse impacts of prediction errors on grid dispatch. Wind power exhibits a significantly nonlinear dependence on multiple influencing factors. However, existing methods struggle to effectively resolve multi-dimensional feature redundancy and multi-scale non-stationary
[...] Read more.
Enhancing the accuracy of short-term wind power forecasting helps mitigate the adverse impacts of prediction errors on grid dispatch. Wind power exhibits a significantly nonlinear dependence on multiple influencing factors. However, existing methods struggle to effectively resolve multi-dimensional feature redundancy and multi-scale non-stationary evolutionary characteristics inherent in far-offshore wind power forecasting tasks. This leads to bottlenecks such as insufficient feature discriminability and temporal dependency focus shift under complex marine environments, ultimately limiting further improvements in prediction accuracy. To address these challenges, this paper proposes a federated learning-based adaptive multi-head attention model for wind power forecasting (Fed-AMHA). The proposed framework operates as follows: First, each wind farm client utilizes a Bidirectional Long Short-Term Memory (BiLSTM) network to model input sequences bidirectionally, capturing long-term temporal dependencies. Subsequently, linear projection and parallel one-dimensional convolution operations are introduced to mine multi-scale local temporal features from each time step and its neighborhood. Building upon this, channel attention and multi-head temporal feature attention mechanisms are stacked. The model adaptively adjusts the weights of different time slices and feature channels by learning the importance of each channel to the forecasting task. The central server then aggregates the model parameters uploaded by the clients via averaging, enabling cross-site collaborative training without directly sharing raw data. Simulation results based on public datasets and actual wind farm data under various short-term forecasting scenarios demonstrate that the proposed model consistently achieves lower prediction errors and superior stability compared to existing forecasting models under identical settings.
Full article

Figure 1
Open AccessReview
Talent Identification and AI-Driven Decision Tools in Sport: A Policy-Oriented Perspective on Algorithmic Bias, Data Privacy, and Digital Determinism in Player Evaluation
by
Elia Morgulev and Ofer H. Azar
Big Data Cogn. Comput. 2026, 10(5), 146; https://doi.org/10.3390/bdcc10050146 - 7 May 2026
Abstract
Big-data analytics are increasingly used in scouting and talent identification, with machine learning (ML) tools applied to evaluate and predict player performance based on match statistics, video tracking, physical and anthropometric tests, psychological assessments, social media data, and qualitative scouting reports. Advances in
[...] Read more.
Big-data analytics are increasingly used in scouting and talent identification, with machine learning (ML) tools applied to evaluate and predict player performance based on match statistics, video tracking, physical and anthropometric tests, psychological assessments, social media data, and qualitative scouting reports. Advances in computer vision, together with the emergence of affordable automated broadcasting and data collection systems, have extended the deployment of ML-driven scouting from professional to youth sport. The use of algorithms in educational, employment, and healthcare settings has been shown to introduce biases and discrimination while wrongly assuming accuracy and objectivity because the decisions are made automatically and quantitatively. In this respect, we briefly describe the development of data-driven performance analysis and how ML-based technologies are currently applied for early screening and comparison of large player populations. Based on a narrative overview of the literature, we draw on evidence from education, employment, and healthcare to identify risks that may also emerge in ML-driven player evaluation, including algorithmic bias, non-representative training data, privacy concerns, and the persistence of model-based labels over time, especially in youth sport. Our main contribution is translating these threats into governance principles and operational safeguards for responsible use of AI in scouting and talent identification.
Full article
(This article belongs to the Special Issue AI and Data Science in Sports Analytics)
Open AccessArticle
AI-Driven Generation of Old English: A Framework for Low-Resource Languages
by
Rodrigo Gabriel Salazar Alva, Matías Núñez, Cristian López Del Alamo and Javier Martín Arista
Big Data Cogn. Comput. 2026, 10(5), 145; https://doi.org/10.3390/bdcc10050145 - 6 May 2026
Abstract
►▼
Show Figures
Preserving ancient languages is essential for understanding the cultural and linguistic heritage of humanity. Old English, however, remains critically under-resourced, which limits its accessibility to modern natural language processing (NLP) techniques. We present a scalable framework that uses advanced large language models (LLMs)
[...] Read more.
Preserving ancient languages is essential for understanding the cultural and linguistic heritage of humanity. Old English, however, remains critically under-resourced, which limits its accessibility to modern natural language processing (NLP) techniques. We present a scalable framework that uses advanced large language models (LLMs) to generate high-quality Old English texts to address this gap. In this study, we specifically employ state-of-the-art models, including Llama-3.1-8B and Mistral-7B, as our foundation models, which are then adapted to the unique characteristics of Old English. Our approach combines parameter-efficient fine-tuning (Low-Rank Adaptation (LoRA)), data augmentation via back-translation, and a dual-agent pipeline that separates content generation (in English) and translation (into Old English). Evaluation with automated metrics (BLEU, METEOR, and CHRF) shows improvements over baseline models, with BLEU scores increasing from 26 to over 65 for English-to-Old English translation. Expert human assessment confirms high grammatical accuracy and stylistic fidelity in the generated texts, with average scores of 9.0/10 for inflection and word order, 9.1/10 for lexical authenticity, and 7.8 for semantic coherence. These results demonstrate that the framework can reliably expand limited historical corpora while maintaining linguistic integrity, with immediate practical applications in digital humanities research, computational philology, and the development of educational resources for Old English study. Beyond expanding the Old English corpus, our method offers a practical blueprint for revitalizing other endangered languages, thus linking AI innovation with the goals of cultural preservation.
Full article

Figure 1
Open AccessArticle
HYSARD: A Hybrid Feature-Fusion Model for Sarcasm Detection Using RoBERTa Embeddings and Linguistic Features
by
Ismail Jabri, Zine Eddine Louriga, Aziza El Ouaazizi and Abdelaziz Ahaitouf
Big Data Cogn. Comput. 2026, 10(5), 144; https://doi.org/10.3390/bdcc10050144 - 6 May 2026
Abstract
Sarcasm detection remains a challenging task in natural language processing because sarcastic expressions often convey meanings that contradict their literal wording. Although transformer-based encoders such as RoBERTa capture contextual semantics effectively, sparse linguistic signals common in sarcastic user-generated text, such as exaggerated punctuation,
[...] Read more.
Sarcasm detection remains a challenging task in natural language processing because sarcastic expressions often convey meanings that contradict their literal wording. Although transformer-based encoders such as RoBERTa capture contextual semantics effectively, sparse linguistic signals common in sarcastic user-generated text, such as exaggerated punctuation, elongated words, capitalization, and sentiment contrast, may not always remain explicitly accessible in the final sentence representation. To address this limitation, we propose HYSARD, a hybrid feature-fusion model that combines RoBERTa-based sentence embeddings with complementary linguistic features, including sentiment polarity, stylistic markers, syntactic patterns, and TF-IDF lexical cues. The resulting feature space is refined through Random Forest-based feature selection to reduce redundancy and improve robustness, while SMOTE mitigates class imbalance during training. We evaluate HYSARD on the SemEval-2022 iSarcasmEval dataset and the balanced Main and Political subsets of SARC 2.0. Results show strong and consistent performance across datasets, with an F1-score of 0.80 on iSarcasmEval, while held-out test-set error analysis further highlights strong class-wise discrimination. The ablation study further confirms that combining contextual embeddings with explicit linguistic cues improves sarcasm detection over reduced feature configurations. These findings show that hybrid feature fusion remains an effective and practical strategy for sarcasm detection in noisy social media text.
Full article
(This article belongs to the Special Issue Natural Language Processing and Text Analysis in Social Media)
►▼
Show Figures

Graphical abstract
Open AccessArticle
Evaluating Computational Approaches for Harmful Content Analysis: Promise, Pitfalls and Tools for Responsible Research
by
Itai Himelboim and Mudit Baid
Big Data Cogn. Comput. 2026, 10(5), 143; https://doi.org/10.3390/bdcc10050143 - 2 May 2026
Abstract
This manuscript develops and demonstrates a practical framework for evaluating automated classifiers used in communication research, using harmful language detection as an illustrative case. We combine (a) a structured review of documentation practices for 27 publicly available classifiers and their associated annotation processes
[...] Read more.
This manuscript develops and demonstrates a practical framework for evaluating automated classifiers used in communication research, using harmful language detection as an illustrative case. We combine (a) a structured review of documentation practices for 27 publicly available classifiers and their associated annotation processes with (b) a cross-dataset evaluation that re-tests each model beyond its original training context. Across 27 datasets, we extract and compare reporting on construct definitions, annotator instructions, and inter-annotator agreement, and we quantify generalization by applying each model to multiple out-of-domain test sets. We also benchmark a contemporary large language model (GPT-5) under a consistent prompting protocol to illustrate how LLM-based classification compares to fine-tuned classifiers. Results show that documentation is uneven and often insufficient for theory-driven measurement, inter-annotator agreement varies widely across datasets, and cross-dataset performance frequently drops substantially relative to within-dataset evaluations. Building on these findings and existing validation guidance, we provide a reusable checklist and decision flow to help researchers select, justify, and report classifier-based measures in ways that support transparency and cumulative science. Recommendations for researchers, reviewers, and journal editors stress aligning model selection with standards of validity, reliability, and transparency.
Full article
(This article belongs to the Section Data Mining and Machine Learning)
►▼
Show Figures

Figure 1
Journal Menu
► ▼ Journal Menu-
- BDCC Home
- Aims & Scope
- Editorial Board
- Reviewer Board
- Topical Advisory Panel
- Instructions for Authors
- Special Issues
- Topics
- Sections
- Article Processing Charge
- Indexing & Archiving
- Editor’s Choice Articles
- Most Cited & Viewed
- Journal Statistics
- Journal History
- Journal Awards
- Conferences
- Editorial Office
Journal Browser
► ▼ Journal BrowserHighly Accessed Articles
Latest Books
E-Mail Alert
News
Topics
Topic in
Actuators, Algorithms, BDCC, Future Internet, JMMP, Machines, Robotics, Systems
Smart Product Design and Manufacturing on Industrial Internet
Topic Editors: Pingyu Jiang, Jihong Liu, Ying Liu, Jihong YanDeadline: 30 June 2026
Topic in
Sensors, Electronics, Technologies, AI, Entropy, Quantum Reports, BDCC
Responsible Classic/Quantum AI Technologies for Industrial Applications
Topic Editors: Youyang Qu, Khandakar Ahmed, Zhiyi TianDeadline: 31 July 2026
Topic in
AI, BDCC, Future Internet, Information, Sustainability
Big Data and Artificial Intelligence, 3rd Edition
Topic Editors: Miltiadis D. Lytras, Andreea Claudia SerbanDeadline: 30 August 2026
Topic in
Computers, Electronics, Future Internet, IoT, Network, Sensors, JSAN, Technologies, BDCC
Challenges and Future Trends of Wireless Networks
Topic Editors: Stefano Scanzio, Ramez Daoud, Jetmir Haxhibeqiri, Pedro SantosDeadline: 30 September 2026
Conferences
Special Issues
Special Issue in
BDCC
Machine Learning Applications in Natural Language Processing
Guest Editors: Ying Weng, Kecheng Liu, Chao LiDeadline: 20 May 2026
Special Issue in
BDCC
Application of Digital Technology in Financial Development
Guest Editors: Wei Li, Michael C. S. WongDeadline: 30 May 2026
Special Issue in
BDCC
Human-Centered and Sustainable Artificial Intelligence: Emerging Perspectives in HCI
Guest Editors: Luciano Alessandro Ipsaro Palesi, Damiano Perri, Kouzeleas SteliosDeadline: 31 May 2026
Special Issue in
BDCC
Internet Intelligence for Cybersecurity
Guest Editor: Hui TianDeadline: 12 June 2026



