Big Data and Cognitive Computing

20 pages, 1598 KB

Open AccessArticle

Importance of Data Preprocessing for Accurate and Effective Prediction of Breast Cancer: Evaluation of Model Performance in Novel Data

by Vekani Baloyi, Jamolbek Mattiev and Sello Mokwena

Big Data Cogn. Comput. 2025, 9(10), 266; https://doi.org/10.3390/bdcc9100266 - 21 Oct 2025

Cited by 1 | Viewed by 1334

Abstract

Breast cancer is one of the leading causes of mortality among women globally, and an early and accurate diagnosis is essential for effective treatment and improved survival rates. Traditional diagnostic techniques often struggle to differentiate between benign and malignant tumors due to overlapping [...] Read more.

Breast cancer is one of the leading causes of mortality among women globally, and an early and accurate diagnosis is essential for effective treatment and improved survival rates. Traditional diagnostic techniques often struggle to differentiate between benign and malignant tumors due to overlapping visual characteristics, resulting in false positives or delayed detection. For efficient breast cancer detection with machine learning, it is vital to identify the most significant features because those features play the most important roles in the treatment process. This study addresses this challenge by evaluating and comparing the performance of ten state-of-the-art machine learning classifiers for breast cancer detection using image-derived features. Initially, 30 features were extracted from a novel tertiary hospital dataset, and models were evaluated based on accuracy, precision, recall, and F-measure. To enhance model performance and reduce dimensionality, the Correlation-based Feature Selection (CFS) method was applied, leading to the identification of 11 highly informative features. Our experimental results demonstrate that, while models such as SVM and Logistic Regression achieved the highest accuracy (97.7%) on the full feature set, the Neural Network exhibited a superior performance (97.2%) on the reduced feature set, with a substantial reduction in training time. Most classifiers maintained comparable or improved accuracy with fewer features, indicating effective dimensionality reduction. Furthermore, pairwise statistical significance testing confirmed that ensemble and kernel-based classifiers achieved a statistically superior performance over simpler models. These findings highlight the importance of effective feature selection in developing accurate, efficient, and scalable breast cancer prediction systems. Full article

► Show Figures

Figure 1

14 pages, 586 KB

Open AccessArticle

Complex Table Question Answering with Multiple Cells Recall Based on Extended Cell Semantic Matching

by Hainan Chen and Dongqi Shen

Big Data Cogn. Comput. 2025, 9(10), 265; https://doi.org/10.3390/bdcc9100265 - 20 Oct 2025

Viewed by 1101

Abstract

Tables, as a form of structured or semi-structured data, are widely found in documents, reports, and data manuals. Table-based question answering (TableQA) plays a key role in table document analysis and understanding. Existing approaches to TableQA can be broadly categorized into content-matching methods [...] Read more.

Tables, as a form of structured or semi-structured data, are widely found in documents, reports, and data manuals. Table-based question answering (TableQA) plays a key role in table document analysis and understanding. Existing approaches to TableQA can be broadly categorized into content-matching methods and end-to-end generation methods based on encoder–decoder deep neural networks. Content-matching methods return one or more table cells as answers, thereby preserving the original data and making them more suitable for downstream tasks. End-to-end methods, especially those leveraging large language models (LLMs), have achieved strong performance on various benchmarks. However, the variability in LLM-generated expressions and their heavy reliance on prompt engineering limit their applicability where answer fidelity to the source table is critical. In this work, we propose CBCM (Cell-by-Cell semantic Matching), a fine-grained cell-level matching method that extends the traditional row- and column-matching paradigm to improve accuracy and applicability in TableQA. Furthermore, based on the public IM-TQA dataset, we construct a new benchmark, IM-TQA-X, specifically designed for the multi-row and multi-column cell recall task, a scenario underexplored in existing state-of-the-art content-matching methods. Experimental results show that CBCM improves overall accuracy by 2.5% over the latest row- and column-matching method RGCNRCI (Relational Graph Convolutional Networks based Row and Column Intersection), and boosts accuracy in the multi-row and multi-column recall task from 4.3% to 34%. Full article

► Show Figures

Figure 1

35 pages, 1642 KB

Open AccessArticle

Adopting Generative AI in Higher Education: A Dual-Perspective Study of Students and Lecturers in Saudi Universities

by Doaa M. Bamasoud, Rasheed Mohammad and Sara Bilal

Big Data Cogn. Comput. 2025, 9(10), 264; https://doi.org/10.3390/bdcc9100264 - 18 Oct 2025

Cited by 7 | Viewed by 6116

Abstract

The integration of Generative Artificial Intelligence (GenAI) tools, such as ChatGPT, into higher education has introduced new opportunities and challenges for students and lecturers alike. This study investigates the psychological, ethical, and institutional factors that shape the adoption of GenAI tools in Saudi [...] Read more.

The integration of Generative Artificial Intelligence (GenAI) tools, such as ChatGPT, into higher education has introduced new opportunities and challenges for students and lecturers alike. This study investigates the psychological, ethical, and institutional factors that shape the adoption of GenAI tools in Saudi Arabian universities, drawing on an extended Technology Acceptance Model (TAM) that incorporates constructs from Self-Determination Theory (SDT) and ethical decision-making. A cross-sectional survey was administered to 578 undergraduate students and 309 university lecturers across three major institutions in Southern Saudi Arabia. Quantitative analysis using Structural Equation Modelling (SmartPLS 4) revealed that perceived usefulness, intrinsic motivation, and ethical trust significantly predicted students’ intention to use GenAI. Perceived ease of use influenced intention both directly and indirectly through usefulness, while institutional support positively shaped perceptions of GenAI’s value. Academic integrity and trust-related concerns emerged as key mediators of motivation, highlighting the ethical tensions in AI-assisted learning. Lecturer data revealed a parallel set of concerns, including fear of overreliance, diminished student effort, and erosion of assessment credibility. Although many faculty members had adapted their assessments in response to GenAI, institutional guidance was often perceived as lacking. Overall, the study offers a validated, context-sensitive model for understanding GenAI adoption in education and emphasises the importance of ethical frameworks, motivation-building, and institutional readiness. These findings offer actionable insights for policy-makers, curriculum designers, and academic leaders seeking to responsibly integrate GenAI into teaching and learning environments. Full article

► Show Figures

Figure 1

27 pages, 3065 KB

Open AccessEditor’s ChoiceArticle

Chinese Financial News Analysis for Sentiment and Stock Prediction: A Comparative Framework with Language Models

by Hsiu-Min Chuang, Hsiang-Chih He and Ming-Che Hu

Big Data Cogn. Comput. 2025, 9(10), 263; https://doi.org/10.3390/bdcc9100263 - 16 Oct 2025

Cited by 3 | Viewed by 6789

Abstract

Financial news has a significant impact on investor sentiment and short-term stock price trends. While many studies have applied natural language processing (NLP) techniques to financial forecasting, most have focused on single tasks or English corpora, with limited research in non-English language contexts [...] Read more.

Financial news has a significant impact on investor sentiment and short-term stock price trends. While many studies have applied natural language processing (NLP) techniques to financial forecasting, most have focused on single tasks or English corpora, with limited research in non-English language contexts such as Taiwan. This study develops a joint framework to perform sentiment classification and short-term stock price prediction using Chinese financial news from Taiwan’s top 50 listed companies. Five types of word embeddings—one-hot, TF-IDF, CBOW, skip-gram, and BERT—are systematically compared across 17 traditional, deep, and Transformer models, as well as a large language model (LLaMA3) fully fine-tuned on the Chinese financial texts. To ensure annotation quality, sentiment labels were manually assigned by annotators with finance backgrounds and validated through a double-checking process. Experimental results show that a CNN using skip-gram embeddings achieves the strongest performance among deep learning models, while LLaMA3 yields the highest overall F1-score for sentiment classification. For regression, LSTM consistently provides the most reliable predictive power across different volatility groups, with Bayesian Linear Regression remaining competitive for low-volatility firms. LLaMA3 is the only Transformer-based model to achieve a positive

R^{2}

under high-volatility conditions. Furthermore, forecasting accuracy is higher for the five-day horizon than for the fifteen-day horizon, underscoring the increasing difficulty of medium-term forecasting. These findings confirm that financial news provides valuable predictive signals for emerging markets and that short-term sentiment-informed forecasts enhance real-time investment decisions. Full article

(This article belongs to the Special Issue Natural Language Processing Applications in Big Data)

► Show Figures

Figure 1

25 pages, 1360 KB

Open AccessArticle

Source Robust Non-Parametric Reconstruction of Epidemic-like Event-Based Network Diffusion Processes Under Online Data

by Jiajia Xie, Chen Lin, Xinyu Guo and Cassie S. Mitchell

Big Data Cogn. Comput. 2025, 9(10), 262; https://doi.org/10.3390/bdcc9100262 - 16 Oct 2025

Viewed by 1006

Abstract

Temporal network diffusion models play a crucial role in healthcare, information technology, and machine learning, enabling the analysis of dynamic event-based processes such as disease spread, information propagation, and behavioral diffusion. This study addresses the challenge of reconstructing temporal network diffusion events in [...] Read more.

Temporal network diffusion models play a crucial role in healthcare, information technology, and machine learning, enabling the analysis of dynamic event-based processes such as disease spread, information propagation, and behavioral diffusion. This study addresses the challenge of reconstructing temporal network diffusion events in real time under conditions of missing and evolving data. A novel non-parametric reconstruction method by simple weights differentiationis proposed to enhance source detection robustness with provable improved error bounds. The approach introduces adaptive cost adjustments, dynamically reducing high-risk source penalties and enabling bounded detours to mitigate errors introduced by missing edges. Theoretical analysis establishes enhanced upper bounds on false positives caused by detouring, while a stepwise evaluation of dynamic costs minimizes redundant solutions, resulting in robust Steiner tree reconstructions. Empirical validation on three real-world datasets demonstrates a 5% improvement in Matthews correlation coefficient (MCC), a twofold reduction in redundant sources, and a 50% decrease in source variance. These results confirm the effectiveness of the proposed method in accurately reconstructing temporal network diffusion while improving stability and reliability in both offline and online settings. Full article

(This article belongs to the Special Issue Advances in Graph Learning and Representation Models for Complex Network Analysis)

► Show Figures

Figure 1

24 pages, 13667 KB

Open AccessEditor’s ChoiceArticle

Integrating Graph Retrieval-Augmented Generation into Prescriptive Recommender Systems

by Marvin Niederhaus, Nico Migenda, Julian Weller, Martin Kohlhase and Wolfram Schenck

Big Data Cogn. Comput. 2025, 9(10), 261; https://doi.org/10.3390/bdcc9100261 - 15 Oct 2025

Viewed by 3874

Abstract

Making time-critical decisions with serious consequences is a daily aspect of work environments. To support the process of finding optimal actions, data-driven approaches are increasingly being used. The most advanced form of data-driven analytics is prescriptive analytics, which prescribes actionable recommendations for users. [...] Read more.

Making time-critical decisions with serious consequences is a daily aspect of work environments. To support the process of finding optimal actions, data-driven approaches are increasingly being used. The most advanced form of data-driven analytics is prescriptive analytics, which prescribes actionable recommendations for users. However, the produced recommendations rely on complex models and optimization techniques that are difficult to understand or justify to non-expert users. Currently, there is a lack of platforms that offer easy integration of domain-specific prescriptive analytics workflows into production environments. In particular, there is no centralized environment and standardized approach for implementing such prescriptive workflows. To address these challenges, large language models (LLMs) can be leveraged to improve interpretability by translating complex recommendations into clear, context-specific explanations, enabling non-experts to grasp the rationale behind the suggested actions. Nevertheless, we acknowledge the inherent black-box nature of LLMs, which may introduce limitations in transparency. To mitigate these limitations and to provide interpretable recommendations based on real user knowledge, a knowledge graph is integrated. In this paper, we present and validate a prescriptive analytics platform that integrates ontology-based graph retrieval-augmented generation (GraphRAG) to enhance decision making by delivering actionable and context-aware recommendations. For this purpose, a knowledge graph is created through a fully automated workflow based on an ontology, which serves as the backbone of the prescriptive platform. Data sources for the knowledge graph are standardized and classified according to the ontology by employing a zero-shot classifier. For user-friendly presentation, we critically examine the usability of GraphRAG in prescriptive analytics platforms. We validate our prescriptive platform in a customer clinic with industry experts in our IoT-Factory, a dedicated research environment. Full article

► Show Figures

Figure 1

15 pages, 296 KB

Open AccessArticle

Cognitive Computing Frameworks for Scalable Deception Detection in Textual Data

by Faiza Belbachir

Big Data Cogn. Comput. 2025, 9(10), 260; https://doi.org/10.3390/bdcc9100260 - 14 Oct 2025

Viewed by 1176

Abstract

Detecting deception in emotionally grounded natural language remains a significant challenge due to the subtlety and context dependence of deceptive intent. In this work, we use a structured behavioral dataset in which participants produce truthful and deceptive statements under emotional and social constraints. [...] Read more.

Detecting deception in emotionally grounded natural language remains a significant challenge due to the subtlety and context dependence of deceptive intent. In this work, we use a structured behavioral dataset in which participants produce truthful and deceptive statements under emotional and social constraints. To maintain label accuracy and semantic consistency, we propose a multilayer validation pipeline combining selfconsistency prompting with feedback-guided revision, implemented through the CoTAM (Chain-of-Thought Assisted Modification) method. Our results demonstrate that this framework enhances deception detection by leveraging a sentence decomposition strategy that highlights subtle emotional and strategic cues, improving interpretability for both models and human annotators. Full article

► Show Figures

Figure 1

38 pages, 913 KB

Open AccessEditor’s ChoiceArticle

Towards the Adoption of Recommender Systems in Online Education: A Framework and Implementation

by Alex Martínez-Martínez, Águeda Gómez-Cambronero, Raul Montoliu and Inmaculada Remolar

Big Data Cogn. Comput. 2025, 9(10), 259; https://doi.org/10.3390/bdcc9100259 - 14 Oct 2025

Cited by 4 | Viewed by 4385

Abstract

The rapid expansion of online education has generated large volumes of learner interaction data, highlighting the need for intelligent systems capable of transforming this information into personalized guidance. Educational Recommender Systems (ERS) represent a key application of big data analytics and machine learning, [...] Read more.

The rapid expansion of online education has generated large volumes of learner interaction data, highlighting the need for intelligent systems capable of transforming this information into personalized guidance. Educational Recommender Systems (ERS) represent a key application of big data analytics and machine learning, offering adaptive learning pathways that respond to diverse student needs. For widespread adoption, these systems must align with pedagogical principles while ensuring transparency, interpretability, and seamless integration into Learning Management Systems (LMS). This paper introduces a comprehensive framework and implementation of an ERS designed for platforms such as Moodle. The system integrates big data processing pipelines to support scalability, real-time interaction, and multi-layered personalization, including data collection, preprocessing, recommendation generation, and retrieval. A detailed use case demonstrates its deployment in a real educational environment, underlining both technical feasibility and pedagogical value. Finally, the paper discusses challenges such as data sparsity, learner model complexity, and evaluation of effectiveness, offering directions for future research at the intersection of big data technologies and digital education. By bridging theoretical models with operational platforms, this work contributes to sustainable and data-driven personalization in online learning ecosystems. Full article

► Show Figures

Figure 1

36 pages, 2906 KB

Open AccessEditor’s ChoiceReview

Data Organisation for Efficient Pattern Retrieval: Indexing, Storage, and Access Structures

by Paraskevas Koukaras and Christos Tjortjis

Big Data Cogn. Comput. 2025, 9(10), 258; https://doi.org/10.3390/bdcc9100258 - 13 Oct 2025

Cited by 3 | Viewed by 3458

Abstract

The increasing scale and complexity of data mining outputs, such as frequent itemsets, association rules, sequences, and subgraphs have made efficient pattern retrieval a critical, yet underexplored challenge. This review addresses the organisation, indexing, and access strategies, which enable scalable and responsive retrieval [...] Read more.

The increasing scale and complexity of data mining outputs, such as frequent itemsets, association rules, sequences, and subgraphs have made efficient pattern retrieval a critical, yet underexplored challenge. This review addresses the organisation, indexing, and access strategies, which enable scalable and responsive retrieval of structured patterns. We examine the underlying types of data and pattern outputs, common retrieval operations, and the variety of query types encountered in practice. Key indexing structures are surveyed, including prefix trees, inverted indices, hash-based approaches, and bitmap-based methods, each suited to different pattern representations and workloads. Storage designs are discussed with attention to metadata annotation, format choices, and redundancy mitigation. Query optimisation strategies are reviewed, emphasising index-aware traversal, caching, and ranking mechanisms. This paper also explores scalability through parallel, distributed, and streaming architectures, and surveys current systems and tools, which integrate mining and retrieval capabilities. Finally, we outline pressing challenges and emerging directions, such as supporting real-time and uncertainty-aware retrieval, and enabling semantic, cross-domain pattern access. Additional frontiers include privacy-preserving indexing and secure query execution, along with integration of repositories into machine learning pipelines for hybrid symbolic–statistical workflows. We further highlight the need for dynamic repositories, probabilistic semantics, and community benchmarks to ensure that progress is measurable and reproducible across domains. This review provides a comprehensive foundation for designing next-generation pattern retrieval systems, which are scalable, flexible, and tightly integrated into analytic workflows. The analysis and roadmap offered are relevant across application areas including finance, healthcare, cybersecurity, and retail, where robust and interpretable retrieval is essential. Full article

► Show Figures

Figure 1

24 pages, 1550 KB

Open AccessArticle

Tester-Guided Graph Learning with End-to-End Detection Certificates for Triangle-Based Anomalies

by Manuel J. C. S. Reis

Big Data Cogn. Comput. 2025, 9(10), 257; https://doi.org/10.3390/bdcc9100257 - 12 Oct 2025

Viewed by 1001

Abstract

We investigate anomaly detection in complex networks through a property-testing-guided graph neural model (PT-GNN) that provides an end-to-end miss-probability certificate

(δ + α)

. The method combines (i) a wedge-sampling tester that estimates triangle-closure frequency and derives a concentration bound [...] Read more.

We investigate anomaly detection in complex networks through a property-testing-guided graph neural model (PT-GNN) that provides an end-to-end miss-probability certificate

(δ + α)

. The method combines (i) a wedge-sampling tester that estimates triangle-closure frequency and derives a concentration bound

(δ)

via Bernstein’s inequality, with (ii) a lightweight classifier over structural features whose validation error contributes

(α)

. The overall certificate is given by the sum

(δ + α)

, quantifying the probability of missed anomalies under bounded sampling. On synthetic communication graphs with n = 1000, edge probability p = 0.01, and anomalous subgraph size k = 120, PT-GNN achieves perfect detection performance (AUC = 1.0, F1 = 1.0) across all tested regimes. Moreover, the miss-probability certificate tightens systematically as the tester budget m increases (e.g., for

ε

= 0.06, enlarging m from 2000 to 8000 reduces

(δ + α)

from ≈0.87 to ≈0.49). These results demonstrate that PT-GNN effectively couples graph learning with property testing, offering both strong empirical detection and formally verifiable guarantees in anomaly detection tasks. Full article

(This article belongs to the Special Issue Advances in Graph Learning and Representation Models for Complex Network Analysis)

► Show Figures

Figure 1

40 pages, 2077 KB

Open AccessArticle

Robust Clinical Querying with Local LLMs: Lexical Challenges in NL2SQL and Retrieval-Augmented QA on EHRs

by Luka Blašković, Nikola Tanković, Ivan Lorencin and Sandi Baressi Šegota

Big Data Cogn. Comput. 2025, 9(10), 256; https://doi.org/10.3390/bdcc9100256 - 11 Oct 2025

Viewed by 4121

Abstract

Electronic health records (EHRs) are typically stored in relational databases, making them difficult to query for nontechnical users, especially under privacy constraints. We evaluate two practical clinical NLP workflows, natural language to SQL (NL2SQL) for EHR querying and retrieval-augmented generation for clinical question [...] Read more.

Electronic health records (EHRs) are typically stored in relational databases, making them difficult to query for nontechnical users, especially under privacy constraints. We evaluate two practical clinical NLP workflows, natural language to SQL (NL2SQL) for EHR querying and retrieval-augmented generation for clinical question answering (RAG-QA), with a focus on privacy-preserving deployment. We benchmark nine large language models, spanning open-weight options (DeepSeek V3/V3.1, Llama-3.3-70B, Qwen2.5-32B, Mixtral-8 × 22B, BioMistral-7B, and GPT-OSS-20B) and proprietary APIs (GPT-4o and GPT-5). The models were chosen to represent a diverse cross-section spanning sparse MoE, dense general-purpose, domain-adapted, and proprietary LLMs. On MIMICSQL (27,000 generations; nine models × three runs), the best NL2SQL execution accuracy (EX) is 66.1% (GPT-4o), followed by 64.6% (GPT-5). Among open-weight models, DeepSeek V3.1 reaches 59.8% EX, while DeepSeek V3 reaches 58.8%, with Llama-3.3-70B at 54.5% and BioMistral-7B achieving only 11.8%, underscoring a persistent gap relative to general-domain benchmarks. We introduce SQL-EC, a deterministic SQL error-classification framework with adjudication, revealing string mismatches as the dominant failure (86.3%), followed by query-join misinterpretations (49.7%), while incorrect aggregation-function usage accounts for only 6.7%. This highlights lexical/ontology grounding as the key bottleneck for NL2SQL in the biomedical domain. For RAG-QA, evaluated on 100 synthetic patient records across 20 questions (54,000 reference–generation pairs; three runs), BLEU and ROUGE-L fluctuate more strongly across models, whereas BERTScore remains high on most, with DeepSeek V3.1 and GPT-4o among the top performers; pairwise t-tests confirm that significant differences were observed among the LLMs. Cost–performance analysis based on measured token usage shows per-query costs ranging from USD 0.000285 (GPT-OSS-20B) to USD 0.005918 (GPT-4o); DeepSeek V3.1 offers the best open-weight cost–accuracy trade-off, and GPT-5 provides a balanced API alternative. Overall, the privacy-conscious RAG-QA attains strong semantic fidelity, whereas the clinical NL2SQL remains brittle under lexical variation. SQL-EC pinpoints actionable failure modes, motivating ontology-aware normalization and schema-linked prompting for robust clinical querying. Full article

(This article belongs to the Special Issue Advances in Large Language Models for Biological and Medical Applications)

► Show Figures

Figure 1

33 pages, 845 KB

Open AccessEditor’s ChoiceReview

An Overview of AI-Guided Thyroid Ultrasound Image Segmentation and Classification for Nodule Assessment

by Michalis Savelonas

Big Data Cogn. Comput. 2025, 9(10), 255; https://doi.org/10.3390/bdcc9100255 - 10 Oct 2025

Cited by 6 | Viewed by 6223

Abstract

Accurate segmentation and analysis of thyroid nodules in ultrasound (US) images are essential for the diagnosis and management of thyroid conditions, including cancer. Despite advancements in medical imaging, achieving accurate and efficient segmentation remains a significant challenge due to the complexity and variability [...] Read more.

Accurate segmentation and analysis of thyroid nodules in ultrasound (US) images are essential for the diagnosis and management of thyroid conditions, including cancer. Despite advancements in medical imaging, achieving accurate and efficient segmentation remains a significant challenge due to the complexity and variability of US images. Recently, deep learning (DL) techniques, such as convolutional neural networks (CNNs) and vision transformers (ViTs), have emerged as powerful tools for computer-aided diagnosis (CAD). This review highlights recent advancements in thyroid US image segmentation, focusing on state-of-the-art DL techniques such as contrastive learning, consistency learning, and knowledge-driven DL. We explore various approaches to improve segmentation accuracy, including multi-task learning, self-supervised learning, and methods that minimize reliance on the availability of large, annotated datasets. Additionally, we examine the clinical significance of these methods in differentiating between benign and malignant nodules, as well as their potential for integration into clinically adopted, fully automated CAD systems. By addressing the latest developments and ongoing challenges, this review serves as a comprehensive reference for future research and clinical implementation of thyroid US diagnostics. Full article

► Show Figures

Figure 1

37 pages, 5762 KB

Open AccessArticle

Fast Adaptive Approximate Nearest Neighbor Search with Cluster-Shaped Indices

by Vladimir Kazakovtsev, Mikhail Plekhanov, Alexandr Naumchev, Guzel Shkaberina, Igor Masich, Lyudmila Egorova, Alena Stupina, Aleksey Popov and Lev Kazakovtsev

Big Data Cogn. Comput. 2025, 9(10), 254; https://doi.org/10.3390/bdcc9100254 - 9 Oct 2025

Cited by 1 | Viewed by 6258

Abstract

In this study, we propose a novel adaptive algorithm for approximate nearest neighbor (ANN) search, based on the inverted file (IVF) index (cluster-based index) and online query complexity classification. The concept of the classical IVF search implemented in vector databases is as follows: [...] Read more.

In this study, we propose a novel adaptive algorithm for approximate nearest neighbor (ANN) search, based on the inverted file (IVF) index (cluster-based index) and online query complexity classification. The concept of the classical IVF search implemented in vector databases is as follows: all data vectors are divided into clusters, and each cluster is assigned to its central point (centroid). For an ANN search query, the closest centroids are determined, and the further search continues in the corresponding clusters only. In our study, the complexity of each query is assessed and classified with the use of results of an initial trial search in a limited number of clusters. Based on this classification, the algorithm dynamically determines the presumably sufficient number of clusters which is sufficient to achieve the desired Recall value, thereby improving vector search efficiency. Our experiments show that such a complexity classifier can be built with the use of a single feature, and we propose an algorithm for its training. We studied the impact of various features on the query processing and discovered a strong dependence on the number of clusters that contains at least one nearest neighbor (productive clusters). The new algorithm is designed to be implemented on top of the IVF search which is a well-known algorithm for approximate nearest neighbor search and uses existing IVF indexes that are widely used in the most popular vector database management systems, such as pgvector. The results obtained demonstrate a significant increase in the speed of nearest neighbor search (up to 35%) while maintaining a high Recall rate of 0.99. Additionally, the search algorithm is deterministic, which might be extremely important for tasks where the reproducibility of results plays a crucial role. The developed algorithm has been tested on datasets of varying sizes up to one billion data vectors. Full article

► Show Figures

Figure 1

26 pages, 1895 KB

Open AccessArticle

A Pattern-Based Framework for Automated Migration of Monolithic Applications to Microservices

by Hossam Hassan, Manal A. Abdel-Fattah and Wael Mohamed

Big Data Cogn. Comput. 2025, 9(10), 253; https://doi.org/10.3390/bdcc9100253 - 6 Oct 2025

Cited by 3 | Viewed by 2831

Abstract

Over the past decade, many software enterprises have migrated from monolithic to microservice architectures to enhance scalability, maintainability, and performance. However, this transition presents significant challenges, requiring considerable development efforts, research, customization, and resource allocation over extended periods. Furthermore, the success of migration [...] Read more.

Over the past decade, many software enterprises have migrated from monolithic to microservice architectures to enhance scalability, maintainability, and performance. However, this transition presents significant challenges, requiring considerable development efforts, research, customization, and resource allocation over extended periods. Furthermore, the success of migration is not guaranteed, highlighting the complexities organizations face in modernizing their software systems. To address these challenges, this study introduces Mono2Micro, a comprehensive framework designed to automate the migration process while preserving structural integrity and optimizing service boundaries. The framework focuses on three core patterns: database patterns, service decomposition, and communication patterns. It leverages machine learning algorithms, including Random Forest and Louvain clustering, to analyze database query patterns along with static and dynamic database model analysis, which enables the identification of relationships between models, facilitating the systematic decomposition of microservices while ensuring efficient inter-service communication. To validate its effectiveness, Mono2Micro was applied to a student information system for faculty management, demonstrating its ability to streamline the migration process while maintaining functional integrity. The proposed framework offers a systematic and scalable solution for organizations and researchers seeking efficient migration from monolithic systems to microservices. Full article

(This article belongs to the Special Issue Advanced Software and Machine Learning Techniques for System Architectures and Big Data)

► Show Figures

Figure 1

18 pages, 478 KB

Open AccessEditor’s ChoiceReview

A Digital Twin Threat Survey

by Manuel Suárez-Román, Mario Sanz-Rodrigo, Andrés Marín-López and David Arroyo

Big Data Cogn. Comput. 2025, 9(10), 252; https://doi.org/10.3390/bdcc9100252 - 2 Oct 2025

Cited by 2 | Viewed by 3744

Abstract

Virtual and digital twins are means of high value to characterize, model and control physical systems, providing the basis for a simulation environment and lab. In the case of a digital twin, it is possible to have a replica of a physical environment [...] Read more.

Virtual and digital twins are means of high value to characterize, model and control physical systems, providing the basis for a simulation environment and lab. In the case of a digital twin, it is possible to have a replica of a physical environment by means of reliable sensor networks and accurate data. In this paper we analyse in detail the threats to the reliability of the information extracted from these sensor networks, along with a set of challenges to guarantee data liveness and trustworthiness. Full article

(This article belongs to the Topic Internet of Things Architectures, Applications, and Strategies: Emerging Paradigms, Technologies, and Advancing AI Integration)

► Show Figures

Figure 1

23 pages, 2619 KB

Open AccessArticle

Monitoring of First Responders Biomedical Data During Training with Innovative Virtual Reality Technologies

by Lýdie Leová, Martin Molek, Petr Volf, Marek Sokol, Jan Hejda, Zdeněk Hon, Marek Bureš and Patrik Kutilek

Big Data Cogn. Comput. 2025, 9(10), 251; https://doi.org/10.3390/bdcc9100251 - 30 Sep 2025

Viewed by 1996

Abstract

Traditional training methods for first responders are often limited by time, resources, and safety constraints, which reduces their consistency and effectiveness. This study focused on two main issues: whether exposure to virtual reality training scenarios induces measurable physiological changes in heart rate and [...] Read more.

Traditional training methods for first responders are often limited by time, resources, and safety constraints, which reduces their consistency and effectiveness. This study focused on two main issues: whether exposure to virtual reality training scenarios induces measurable physiological changes in heart rate and heart rate variability, and whether these responses differ between police and firefighter contexts. The aim of this study was to explore the integration of virtual reality technologies into responder training and to evaluate how biomedical monitoring can be used to assess training effectiveness. A pilot measurement was conducted with ten participants who completed systematic crime scene investigation scenarios in both domains. Heart activity was continuously recorded using a wearable sensor and analyzed for heart rate and heart rate variability parameters, while cognitive load and task performance were also assessed. The collected data were statistically evaluated using tests of normality and paired comparisons between baseline and virtual reality phases. The results showed a significant increase in heart rate and a decrease in heart rate variability during virtual reality exposure compared to baseline, with higher cognitive load and success rates in police scenarios compared to firefighter scenarios. These findings indicate that virtual reality scenarios can elicit measurable psychophysiological responses and highlight the potential of combining immersive technologies with biomedical monitoring for the development of adaptive and effective training methods for first responders. Full article

► Show Figures

Figure 1

21 pages, 4354 KB

Open AccessArticle

Exploring the Application and Characteristics of Homomorphic Encryption Based on Pixel Scrambling Algorithm in Image Processing

by Tieyu Zhao

Big Data Cogn. Comput. 2025, 9(10), 250; https://doi.org/10.3390/bdcc9100250 - 30 Sep 2025

Viewed by 1303

Abstract

Homomorphic encryption is well known to researchers, yet its application in image processing is scarce. The diversity of image processing algorithms makes homomorphic encryption implementation challenging. Current research often uses the CKKS algorithm, but it has core bottlenecks in image encryption, such as [...] Read more.

Homomorphic encryption is well known to researchers, yet its application in image processing is scarce. The diversity of image processing algorithms makes homomorphic encryption implementation challenging. Current research often uses the CKKS algorithm, but it has core bottlenecks in image encryption, such as the mismatch between image data and the homomorphic operation mechanism, high 2D-structure-induced costs, noise-related visual quality damage, and poor nonlinear operational support. This study, based on image pixel characteristics, analyzes homomorphic encryption via pixel scrambling algorithms. Using magic square, Arnold, Henon map, and Hilbert curve transformations as starting points, it reveals their homomorphic properties in image processing. This further explores general pixel scrambling algorithm homomorphic encryption properties, offering valuable insights for homomorphic encryption applications in image processing. Full article

(This article belongs to the Topic Applications of Image and Video Processing in Medical Imaging)

► Show Figures

Figure 1

33 pages, 20640 KB

Open AccessArticle

A Complex Network Science Perspective on Urban Parcel Locker Placement

by Enrico Corradini, Mattia Mandorlini, Filippo Mariani, Paolo Roselli, Samuele Sacchetti and Matteo Spiga

Big Data Cogn. Comput. 2025, 9(10), 249; https://doi.org/10.3390/bdcc9100249 - 30 Sep 2025

Cited by 2 | Viewed by 2323

Abstract

The rapid rise of e-commerce is intensifying pressure on last-mile delivery networks, making the strategic placement of parcel lockers an urgent urban challenge. In this work, we adapt multilayer two-mode Social Network Analysis to the parcel-locker siting problem, modeling city-scale systems as bipartite [...] Read more.

The rapid rise of e-commerce is intensifying pressure on last-mile delivery networks, making the strategic placement of parcel lockers an urgent urban challenge. In this work, we adapt multilayer two-mode Social Network Analysis to the parcel-locker siting problem, modeling city-scale systems as bipartite networks linking spatially resolved demand zones to locker locations using only open-source demographic and geographic data. We introduce two new Social Network Analysis metrics, Dual centrality and Coverage centrality, designed to identify both structurally critical and highly accessible lockers within the network. Applying our framework to Milan, Rome, and Naples, we find that conventional coverage-based strategies successfully maximize immediate service reach, but tend to prioritize redundant hubs. In contrast, Dual centrality reveals a distinct set of lockers whose presence is essential for maintaining overall connectivity and resilience, often acting as hidden bridges between user communities. Comparative analysis with state-of-the-art multi-criteria optimization baselines confirms that our network-centric metrics deliver complementary, and in some cases better, guidance for robust locker placement. Our results show that a network-analytic lens yields actionable guidance for resilient last-mile locker siting. The method is reproducible from open data (potential-access weights) and plug-in compatible with observed assignments. Importantly, the path-based results (Coverage centrality) are adjacency-driven and thus largely insensitive to volumetric weights. Full article

(This article belongs to the Special Issue Advances in Graph Learning and Representation Models for Complex Network Analysis)

► Show Figures

Figure 1

16 pages, 2125 KB

Open AccessArticle

A Multi-Model Machine Learning Framework for Daily Stock Price Prediction

by Bharatendra Rai and Leili Soltanisehat

Big Data Cogn. Comput. 2025, 9(10), 248; https://doi.org/10.3390/bdcc9100248 - 28 Sep 2025

Cited by 4 | Viewed by 5858

Abstract

Stock price prediction remains a challenging problem due to the inherent volatility and complexity of financial markets. This study proposes a multi-model machine learning framework for one-day-ahead stock price prediction using thirty-six features derived from technical indicators. Empirical analysis is conducted on data [...] Read more.

Stock price prediction remains a challenging problem due to the inherent volatility and complexity of financial markets. This study proposes a multi-model machine learning framework for one-day-ahead stock price prediction using thirty-six features derived from technical indicators. Empirical analysis is conducted on data from Apple, Tesla, and NVIDIA, employing nine classification algorithms, including support vector machines, random forests, extreme gradient boosting, and logistic regression. Results indicate that momentum-based indicators are the most influential predictors. While support vector machines achieve the highest accuracy for Apple, extreme gradient boosting performed best for NVIDIA and Tesla. In addition, explainable AI techniques are applied to interpret individual model predictions, thereby enhancing transparency and trust in the results. The study contributes to financial analytics research by providing a comparative evaluation of diverse machine learning methods and highlighting key indicators critical for short-term stock price forecasting. Full article

(This article belongs to the Topic Electronic Communications, IOT and Big Data, 2nd Volume)

► Show Figures

Figure 1

24 pages, 817 KB

Open AccessArticle

Leveraging Large Language Models for Sustainable and Inclusive Web Accessibility

by Manuel Andruccioli, Barry Bassi, Giovanni Delnevo and Paola Salomoni

Big Data Cogn. Comput. 2025, 9(10), 247; https://doi.org/10.3390/bdcc9100247 - 26 Sep 2025

Cited by 4 | Viewed by 2528

Abstract

The increasing complexity of modern web applications, which are composed of dynamic and asynchronous components, poses a significant challenge for digital inclusion. Traditional automated tools typically analyze only the static HTML markup generated by frontend and backend frameworks. Recent advances in Large Language [...] Read more.

The increasing complexity of modern web applications, which are composed of dynamic and asynchronous components, poses a significant challenge for digital inclusion. Traditional automated tools typically analyze only the static HTML markup generated by frontend and backend frameworks. Recent advances in Large Language Models (LLMs) offer a novel approach to enhance the validation process by directly analyzing the source code. In this paper, we investigate the capacity of LLMs to interpret and reason dynamically generated content, providing real-time feedback on web accessibility. Our findings show that LLMs can correctly anticipate the presence of accessibility violations in the generated HTML code, going beyond the capabilities of traditional validators, also evaluating possible issues due to the asynchronous execution of the web application. However, together with legitimate issues, LLMs also produced a relevant number of hallucinated or redundant violations. This study contributes to the broader effort of employing AI with the aim of improving the inclusivity and equity of the web. Full article

(This article belongs to the Special Issue Generative AI and Large Language Models)

► Show Figures

Figure 1

32 pages, 13081 KB

Open AccessArticle

FedIFD: Identifying False Data Injection Attacks in Internet of Vehicles Based on Federated Learning

by Huan Wang, Junying Yang, Jing Sun, Zhe Wang, Qingzheng Liu and Shaoxuan Luo

Big Data Cogn. Comput. 2025, 9(10), 246; https://doi.org/10.3390/bdcc9100246 - 26 Sep 2025

Cited by 2 | Viewed by 1508

Abstract

With the rapid development of intelligent connected vehicle technology, false data injection (FDI) attacks have become a major challenge in the Internet of Vehicles (IoV). While deep learning methods can effectively identify such attacks, the dynamic, distributed architecture of the IoV and limited [...] Read more.

With the rapid development of intelligent connected vehicle technology, false data injection (FDI) attacks have become a major challenge in the Internet of Vehicles (IoV). While deep learning methods can effectively identify such attacks, the dynamic, distributed architecture of the IoV and limited computing resources hinder both privacy protection and lightweight computation. To address this, we propose FedIFD, a federated learning (FL)-based detection method for false data injection attacks. The lightweight threat detection model utilizes basic safety messages (BSM) for local incremental training, and the Q-FedCG algorithm compresses gradients for global aggregation. Original features are reshaped using a time window. To ensure temporal and spatial consistency, a sliding average strategy aligns samples before spatial feature extraction. A dual-branch architecture enables parallel extraction of spatiotemporal features: a three-layer stacked Bidirectional Long Short-Term Memory (BiLSTM) captures temporal dependencies, and a lightweight Transformer models spatial relationships. A dynamic feature fusion weight matrix calculates attention scores for adaptive feature weighting. Finally, a differentiated pooling strategy is applied to emphasize critical features. Experiments on the VeReMi dataset show that the accuracy reaches 97.8%. Full article

(This article belongs to the Special Issue Big Data Analytics with Machine Learning for Cyber Security)

► Show Figures

Figure 1

28 pages, 2160 KB

Open AccessArticle

DTS-MixNet: Dynamic Spatiotemporal Graph Mixed Network for Anomaly Detection in Multivariate Time Series

by Chengxun Tan, Jiayi Hu, Jian Li, Minmin Miao, Wenjun Hu and Shitong Wang

Big Data Cogn. Comput. 2025, 9(10), 245; https://doi.org/10.3390/bdcc9100245 - 25 Sep 2025

Cited by 2 | Viewed by 1552

Abstract

Anomaly detection in multivariate time series (MTS) remains challenging due to the presence of complex and dynamic spatiotemporal dependencies. To address this, we propose the Dynamic Spatiotemporal Graph Mixed Network (DTS-MixNet), which takes a sliding window data as input to predict the next [...] Read more.

Anomaly detection in multivariate time series (MTS) remains challenging due to the presence of complex and dynamic spatiotemporal dependencies. To address this, we propose the Dynamic Spatiotemporal Graph Mixed Network (DTS-MixNet), which takes a sliding window data as input to predict the next time series data and determine its state. The model comprises five blocks. The Temporal Graph Structure Learner (TGSL) generates the attention-weighted graphs via two types of neighbor relationships and the multi-head-attention-based neighbor degrees. Then, the Cross-Temporal Dynamic Encoder (CTDE) aggregates the cross-temporal dependencies from attention-weighted graphs, and encodes them into a proxy multivariate sequence (PMS), which is fed into the proposed Cross-Variable Dynamic Encoder (CVDE). Subsequently, the CVDE captures the sensors-among spatial relationship through multiple local spatial graphs and a global spatial graph, and produces a spatial graph sequence (SGS). Finally, the Spatiotemporal Mixer (TSM) mixes PMS and SGS to build a spatiotemporal mixed sequence (TSMS) for downstream tasks, e.g., classification or prediction. We evaluate on two industrial control datasets and discuss applicability to non-industrial multivariate time series. The experimental results on benchmark datasets show that the proposed DTS-MixNet is encouraging. Full article

► Show Figures

Figure 1

18 pages, 2586 KB

Open AccessArticle

A Comparative Study of X Data About the NHS Using Sentiment Analysis

by Saeed Ur Rehman, Obi Oluchi Blessing and Anwar Ali

Big Data Cogn. Comput. 2025, 9(10), 244; https://doi.org/10.3390/bdcc9100244 - 24 Sep 2025

Cited by 1 | Viewed by 1334

Abstract

This study investigates sentiment analysis of X data about the National Health Service (NHS) during a politically charged period, using lexicon-based, machine learning, and deep learning approaches, as well as topic modelling and aspect-based sentiment analysis (ABSA). This study is distinct in its [...] Read more.

This study investigates sentiment analysis of X data about the National Health Service (NHS) during a politically charged period, using lexicon-based, machine learning, and deep learning approaches, as well as topic modelling and aspect-based sentiment analysis (ABSA). This study is distinct in its comparative evaluation of sentiment analysis techniques on NHS-related tweets during a politically sensitive period, offering insights into public opinion shaped by political discourse. A dataset of 35,000 tweets collected and analysed using various techniques, including VADER, TextBlob, Naive Bayes, Support Vector Machines, Logistic Regression, Ensemble Learning, and BERT. Unlike previous studies that focus on structured feedback or general sentiment, this research uniquely explores unstructured public discourse during an election period, capturing real-time political sentiment towards NHS policies. The sentiment distribution from lexicon-based methods depicted that the presence of stop words could affect model performance. While all models achieved high accuracy on the validation dataset, challenges such as class imbalance and limited labelled data impacted performance, with signs of overfitting observed. Topic modelling identified nine topic clusters, with “waiting list,” “service,” and “immigration” carrying negative sentiments. At the same time, words like “thank,” “support,” “care,” and “team” had the most positive sentiments, reflecting public delight in these areas. ABSA identified positive sentiments towards aspects like “useful service”. This study contributes a comparative framework for evaluating sentiment analysis techniques in politically contextualised healthcare discourse, offering insights for policymakers and researchers. The study underscores the importance of data quality in sentiment analysis. Future research should consider incorporating multilingual datasets, extending data collection periods, optimising deep learning models, and employing hybrid approaches to enhance performance. Full article

► Show Figures

Figure 1

21 pages, 40457 KB

Open AccessArticle

Interpretable Emotion Estimation in Indoor Remote Work Environments via Environmental Sensor Data

by Yuma Toriyama, Tsumugi Isogami and Nobuyoshi Komuro

Big Data Cogn. Comput. 2025, 9(10), 243; https://doi.org/10.3390/bdcc9100243 - 23 Sep 2025

Viewed by 1131

Abstract

Indoor environmental factors such as CO₂ concentration, temperature, and humidity can significantly influence individuals’ emotional states and productivity. This study continuously collected environmental data using wireless sensors and emotional data from wearable devices in an office-like remote-work setting. Machine learning models, including [...] Read more.

Indoor environmental factors such as CO₂ concentration, temperature, and humidity can significantly influence individuals’ emotional states and productivity. This study continuously collected environmental data using wireless sensors and emotional data from wearable devices in an office-like remote-work setting. Machine learning models, including Random Forest and Gradient Boosting Decision Tree, were developed and interpreted using SHAP (Shapley Additive Explanations). The proposed models achieved estimation accuracies above 85%. SHAP analysis revealed that CO₂ concentration, temperature, and humidity were influential factors in predicting pleasant or unpleasant states. These findings demonstrate the feasibility of real-time, data-driven emotion estimation and provide insights into the design of indoor environments that foster comfort and mental well-being. Full article

► Show Figures

Figure 1

Journal Menu

Journal Browser

Big Data Cogn. Comput., Volume 9, Issue 10 (October 2025) – 24 articles

Further Information

Guidelines

MDPI Initiatives

Follow MDPI