Big Data and Cognitive Computing

56 pages, 3118 KiB

Open AccessArticle

Semantic Reasoning Using Standard Attention-Based Models: An Application to Chronic Disease Literature

by Yalbi Itzel Balderas-Martínez, José Armando Sánchez-Rojas, Arturo Téllez-Velázquez, Flavio Juárez Martínez, Raúl Cruz-Barbosa, Enrique Guzmán-Ramírez, Iván García-Pacheco and Ignacio Arroyo-Fernández

Big Data Cogn. Comput. 2025, 9(6), 162; https://doi.org/10.3390/bdcc9060162 - 19 Jun 2025

Viewed by 797

Abstract

Large-language-model (LLM) APIs demonstrate impressive reasoning capabilities, but their size, cost, and closed weights limit the deployment of knowledge-aware AI within biomedical research groups. At the other extreme, standard attention-based neural language models (SANLMs)—including encoder–decoder architectures such as Transformers, Gated Recurrent Units (GRUs), [...] Read more.

Large-language-model (LLM) APIs demonstrate impressive reasoning capabilities, but their size, cost, and closed weights limit the deployment of knowledge-aware AI within biomedical research groups. At the other extreme, standard attention-based neural language models (SANLMs)—including encoder–decoder architectures such as Transformers, Gated Recurrent Units (GRUs), and Long Short-Term Memory (LSTM) networks—are computationally inexpensive. However, their capacity for semantic reasoning in noisy, open-vocabulary knowledge bases (KBs) remains unquantified. Therefore, we investigate whether compact SANLMs can (i) reason over hybrid OpenIE-derived KBs that integrate commonsense, general-purpose, and non-communicable-disease (NCD) literature; (ii) operate effectively on commodity GPUs; and (iii) exhibit semantic coherence as assessed through manual linguistic inspection. To this end, we constructed four training KBs by integrating ConceptNet (600k triples), a 39k-triple general-purpose OpenIE set, and an 18.6k-triple OpenNCDKB extracted from 1200 PubMed abstracts. Encoder–decoder GRU, LSTM, and Transformer models (1–2 blocks) were trained to predict the object phrase given the subject + predicate. Beyond token-level cross-entropy, we introduced the Meaning-based Selectional-Preference Test (MSPT): for each withheld triple, we masked the object, generated a candidate, and measured its surplus cosine similarity over a random baseline using word embeddings, with significance assessed via a one-sided t-test. Hyperparameter sensitivity (311 GRU/168 LSTM runs) was analyzed, and qualitative frame–role diagnostics completed the evaluation. Our results showed that all SANLMs learned effectively from the point of view of the cross entropy loss. In addition, our MSPT provided meaningful semantic insights: for the GRUs (256-dim, 2048-unit, 1-layer): mean similarity

(μ_{s t s})

of 0.641 to the ground truth vs. 0.542 to the random baseline (gap 12.1%;

p < 10^{- 180}

). For the 1-block Transformer:

μ_{s t s} = 0.551

vs.

0.511

(gap 4%;

p < 10^{- 25}

). While Transformers minimized loss and accuracy variance, GRUs captured finer selectional preferences. Both architectures trained within <24 GB GPU VRAM and produced linguistically acceptable, albeit over-generalized, biomedical assertions. Due to their observed performance, LSTM results were designated as baseline models for comparison. Therefore, properly tuned SANLMs can achieve statistically robust semantic reasoning over noisy, domain-specific KBs without reliance on massive LLMs. Their interpretability, minimal hardware footprint, and open weights promote equitable AI research, opening new avenues for automated NCD knowledge synthesis, surveillance, and decision support. Full article

► Show Figures

Figure 1

29 pages, 3879 KiB

Open AccessArticle

Fusion of Sentiment and Market Signals for Bitcoin Forecasting: A SentiStack Network Based on a Stacking LSTM Architecture

by Zhizhou Zhang, Changle Jiang and Meiqi Lu

Big Data Cogn. Comput. 2025, 9(6), 161; https://doi.org/10.3390/bdcc9060161 - 19 Jun 2025

Cited by 2 | Viewed by 2170

Abstract

This paper proposes a comprehensive deep-learning framework, SentiStack, for Bitcoin price forecasting and trading strategy evaluation by integrating multimodal data sources, including market indicators, macroeconomic variables, and sentiment information extracted from financial news and social media. The model architecture is based on a [...] Read more.

This paper proposes a comprehensive deep-learning framework, SentiStack, for Bitcoin price forecasting and trading strategy evaluation by integrating multimodal data sources, including market indicators, macroeconomic variables, and sentiment information extracted from financial news and social media. The model architecture is based on a Stacking-LSTM ensemble, which captures complex temporal dependencies and non-linear patterns in high-dimensional financial time series. To enhance predictive power, sentiment embeddings derived from full-text analysis using the DeepSeek language model are fused with traditional numerical features through early and late data fusion techniques. Empirical results demonstrate that the proposed model significantly outperforms baseline strategies, including Buy & Hold and Random Trading, in cumulative return and risk-adjusted performances. Feature ablation experiments further reveal the critical role of sentiment and macroeconomic inputs in improving forecasting accuracy. The sentiment-enhanced model also exhibits strong performance in identifying high-return market movements, suggesting its practical value for data-driven investment decision-making. Overall, this study highlights the importance of incorporating soft information, such as investor sentiment, alongside traditional quantitative features in financial forecasting models. Full article

► Show Figures

Figure 1

23 pages, 3558 KiB

Open AccessArticle

Research on High-Reliability Energy-Aware Scheduling Strategy for Heterogeneous Distributed Systems

by Ziyu Chen, Jing Wu, Lin Cheng and Tao Tao

Big Data Cogn. Comput. 2025, 9(6), 160; https://doi.org/10.3390/bdcc9060160 - 17 Jun 2025

Viewed by 559

Abstract

With the demand for workflow processing driven by edge computing in the Internet of Things (IoT) and cloud computing growing at an exponential rate, task scheduling in heterogeneous distributed systems has become a key challenge to meet real-time constraints in resource-constrained environments. Existing [...] Read more.

With the demand for workflow processing driven by edge computing in the Internet of Things (IoT) and cloud computing growing at an exponential rate, task scheduling in heterogeneous distributed systems has become a key challenge to meet real-time constraints in resource-constrained environments. Existing studies now attempt to achieve the best balance in terms of time constraints, energy efficiency, and system reliability in Dynamic Voltage and Frequency Scaling environments. This study proposes a two-stage collaborative optimization strategy. With the help of an innovative algorithm design and theoretical analysis, the multi-objective optimization challenges mentioned above are systematically solved. First, based on a reliability-constrained model, we propose a topology-aware dynamic priority scheduling algorithm (EAWRS). This algorithm constructs a node priority function by incorporating in-degree/out-degree weighting factors and critical path analysis to enable multi-objective optimization. Second, to address the time-varying reliability characteristics introduced by DVFS, we propose a Fibonacci search-based dynamic frequency scaling algorithm (SEFFA). This algorithm effectively reduces energy consumption while ensuring task reliability, achieving sub-optimal processor energy adjustment. The collaborative mechanism of EAWRS and SEFFA has well solved the dynamic scheduling challenge based on DAG in heterogeneous multi-core processor systems in the Internet of Things environment. Experimental evaluations conducted at various scales show that, compared with the three most advanced scheduling algorithms, the proposed strategy reduces energy consumption by an average of 14.56% (up to 58.44% under high-reliability constraints) and shortens the makespan by 2.58–56.44% while strictly meeting reliability requirements. Full article

► Show Figures

Figure 1

19 pages, 4606 KiB

Open AccessArticle

Time Series Prediction Method of Clean Coal Ash Content in Dense Medium Separation Based on the Improved EMD-LSTM Model

by Kai Cheng, Xiaokang Zhang, Keping Zhou, Chenao Zhou, Jielin Li, Chun Yang, Yurong Guo and Ranfeng Wang

Big Data Cogn. Comput. 2025, 9(6), 159; https://doi.org/10.3390/bdcc9060159 - 15 Jun 2025

Viewed by 564

Abstract

Real-time ash content control in dense medium coal separation is challenged by time lags between detection and density adjustment, along with nonlinear/noisy signals. This study proposes a hybrid model for clean coal ash content in dense medium separation by integrating empirical mode decomposition, [...] Read more.

Real-time ash content control in dense medium coal separation is challenged by time lags between detection and density adjustment, along with nonlinear/noisy signals. This study proposes a hybrid model for clean coal ash content in dense medium separation by integrating empirical mode decomposition, long short-term memory networks, and sparrow search algorithm optimization. A key innovation lies in removing noise-containing intrinsic mode functions (IMFs) via EMD to ensure clean signal input to the LSTM model. Utilizing production data from a Shanxi coal plant, EMD decomposes ash content time series into intrinsic mode functions (IMFs) and residuals. High-frequency noise-containing IMFs are selectively removed, while LSTM predicts retained components. SSA optimizes LSTM parameters (learning rate, hidden layers, epochs) to minimize prediction errors. Results demonstrate the EMD-IMF1-LSTM-SSA model achieves superior accuracy (RMSE: 0.0099, MAE: 0.0052, MAPE: 0.047%) and trend consistency (NSD: 12), outperforming baseline models. The study also proposes the novel “Vector Value of the Radial Difference (VVRD)” metric, which effectively quantifies prediction trend accuracy. By resolving time-lag issues and mitigating noise interference, the model enables precise ash content prediction 16 min ahead, supporting automated density control, reduced energy waste, and eco-friendly coal processing. This research provides practical tools and new metrics for intelligent coal separation in the context of green mining. Full article

(This article belongs to the Special Issue Application of Deep Neural Networks)

► Show Figures

Figure 1

18 pages, 19551 KiB

Open AccessArticle

FAD-Net: Automated Framework for Steel Surface Defect Detection in Urban Infrastructure Health Monitoring

by Nian Wang, Yue Chen, Weiang Li, Liyang Zhang and Jinghong Tian

Big Data Cogn. Comput. 2025, 9(6), 158; https://doi.org/10.3390/bdcc9060158 - 13 Jun 2025

Viewed by 627

Abstract

Steel plays a fundamental role in modern smart city development, where its surface structural integrity is decisive for operational safety and long-term sustainability. While deep learning approaches show promise, their effectiveness remains limited by inadequate receptive field adaptability, suboptimal feature fusion strategies, and [...] Read more.

Steel plays a fundamental role in modern smart city development, where its surface structural integrity is decisive for operational safety and long-term sustainability. While deep learning approaches show promise, their effectiveness remains limited by inadequate receptive field adaptability, suboptimal feature fusion strategies, and insufficient sensitivity to small defects. To overcome these limitations, we propose FAD-Net, a deep learning framework specifically designed for surface defect detection in steel materials within urban infrastructure. The network incorporates three key innovations: The RFCAConv module, which leverages dynamic receptive field construction and coordinate attention mechanisms to enhance feature representation for defects with long-range spatial dependencies and low-contrast characteristics. The MSDFConv module, employing multi-scale dilated convolutions with optimized dilation rates to preserve fine details while expanding the receptive field. An Auxiliary Head that introduces hierarchical supervision to improve the detection of small-scale defects. Experiments on the GC10-DET dataset showed that FAD-Net achieved 5.0% higher mAP@0.5 than baseline models. Cross-dataset validation with NEU and RDD2022 further confirmed its robustness. These results demonstrate FAD-Net’s effectiveness for automated infrastructure health monitoring. Full article

(This article belongs to the Special Issue Evolutionary Computation and Artificial Intelligence: Building a Sustainable Future for Smart Cities)

► Show Figures

Figure 1

10 pages, 653 KiB

Open AccessArticle

Analysis of Shots Trajectory and Effectiveness in Women’s and Men’s Football European Championship Matches

by Blanca De-la-Cruz-Torres, Miguel Navarro-Castro and Anselmo Ruiz-de-Alarcón-Quintero

Big Data Cogn. Comput. 2025, 9(6), 157; https://doi.org/10.3390/bdcc9060157 - 12 Jun 2025

Cited by 1 | Viewed by 725

Abstract

Shots on target are a crucial factor in football performance, yet the impact of categorizing shots as low or ground-level and high or parabolic has not been fully explored. The objective of this study was to analyze whether there are differences in the [...] Read more.

Shots on target are a crucial factor in football performance, yet the impact of categorizing shots as low or ground-level and high or parabolic has not been fully explored. The objective of this study was to analyze whether there are differences in the frequency and effectiveness (as measured by xGOT) between parabolic and low shots on target in international men’s and women’s football competitions. The results revealed that the most common shot type was the parabolic shot, occurring in 59.86% of shots on goal in the men’s competition (270 shots) and 67.12% in the women’s competition (196 shots). In the overall set of shots, 62.77% were parabolic (466 shots). No significant differences were observed between the competitions (p > 0.05). Regarding the xGOT values, no significant differences were observed for any of the interaction effects analyzed (gender, shot type and shot outcome). The conclusion was that the parabolic shot was the most frequent type of shot on target in both men’s and women’s football. Full article

(This article belongs to the Special Issue AI and Data Science in Sports Analytics)

► Show Figures

Figure 1

20 pages, 966 KiB

Open AccessArticle

An Empirical Study of Proposer–Builder Separation (PBS) Effects on the Ethereum Ecosystem

by Liyi Zeng, Zihao Zhang, Wei Xu and Zhaoquan Gu

Big Data Cogn. Comput. 2025, 9(6), 156; https://doi.org/10.3390/bdcc9060156 - 12 Jun 2025

Viewed by 947

Abstract

Decentralized blockchains have grown into massive and Internet-scale ecosystems, collectively securing hundreds of billions of dollars in value. The complex interplay of technology and economic incentives within blockchain systems creates a delicate balance that is susceptible to significant shifts even from minor changes. [...] Read more.

Decentralized blockchains have grown into massive and Internet-scale ecosystems, collectively securing hundreds of billions of dollars in value. The complex interplay of technology and economic incentives within blockchain systems creates a delicate balance that is susceptible to significant shifts even from minor changes. This paper underscores the importance of conducting thorough, data-driven studies to monitor and understand the impacts of significant shifts in blockchain systems, particularly focusing on Ethereum’s groundbreaking builder–proposer separation (PBS) as a pivotal innovation reshaping the ecosystem. PBS revolutionizes Ethereum’s block production, entrusting builders with block construction and proposers with validation via blockchain consensus, with significant impacts on Ethereum decentralization, fairness, and security. Our empirical study reveals key insights, including the following: (a) A substantial 261% increase in proposer revenue underscores the effectiveness of PBS in promoting widespread adoption, significantly enhancing block rewards and proposer incomes. (b) The small profits garnered by builders, comprising only a 3.5% share of block rewards, raise concerns that the security assumptions based on builder reputation may introduce new threats to the system. (c) PBS promotes a more equitable distribution of resources among network participants by reducing proposer centralization and preventing centralization trends among builders and relays, thereby significantly enhancing fairness and decentralization in the Ethereum ecosystem. This study provides a comprehensive analysis of the dynamics of Ethereum PBS adoption, exploring its effects on revenue redistribution among various participants and highlighting its implications for the Ethereum ecosystem’s decentralization. Full article

(This article belongs to the Special Issue Blockchain and Cloud Computing in Big Data and Generative AI Era)

► Show Figures

Figure 1

25 pages, 2941 KiB

Open AccessArticle

Machine Learning-Based Analysis of Travel Mode Preferences: Neural and Boosting Model Comparison Using Stated Preference Data from Thailand’s Emerging High-Speed Rail Network

by Chinnakrit Banyong, Natthaporn Hantanong, Supanida Nanthawong, Chamroeun Se, Panuwat Wisutwattanasak, Thanapong Champahom, Vatanavongs Ratanavaraha and Sajjakaj Jomnonkwao

Big Data Cogn. Comput. 2025, 9(6), 155; https://doi.org/10.3390/bdcc9060155 - 10 Jun 2025

Viewed by 941

Abstract

This study examines travel mode choice behavior within the context of Thailand’s emerging high-speed rail (HSR) development. It conducts a comparative assessment of predictive capabilities between the conventional Multinomial Logit (MNL) framework and advanced data-driven methodologies, including gradient boosting algorithms (Extreme Gradient Boosting, [...] Read more.

This study examines travel mode choice behavior within the context of Thailand’s emerging high-speed rail (HSR) development. It conducts a comparative assessment of predictive capabilities between the conventional Multinomial Logit (MNL) framework and advanced data-driven methodologies, including gradient boosting algorithms (Extreme Gradient Boosting, Light Gradient Boosting Machine, Categorical Boosting) and neural network architectures (Deep Neural Network, Convolutional Neural Network). The analysis leverages stated preference (SP) data and employs Bayesian optimization in conjunction with a stratified 10-fold cross-validation scheme to ensure model robustness. CatBoost emerges as the top-performing model (area under the curve = 0.9113; accuracy = 0.7557), highlighting travel cost, service frequency, and waiting time as the most influential determinants. These findings underscore the effectiveness of machine learning approaches in capturing complex behavioral patterns, providing empirical evidence to guide high-speed rail policy development in low- and middle-income countries. Practical implications include optimizing fare structures, enhancing service quality, and improving station accessibility to support sustainable adoption. Full article

(This article belongs to the Special Issue Machine Learning and AI Technology for Sustainable Development)

► Show Figures

Figure 1

18 pages, 7506 KiB

Open AccessArticle

Image Visual Quality: Sharpness Evaluation in the Logarithmic Image Processing Framework

by Arnaud Pauwelyn, Maxime Carré, Michel Jourlin, Dominique Ginhac and Fabrice Meriaudeau

Big Data Cogn. Comput. 2025, 9(6), 154; https://doi.org/10.3390/bdcc9060154 - 9 Jun 2025

Viewed by 515

Abstract

In image processing, the acquisition step plays a fundamental role because it determines image quality. The present paper focuses on the issue of blur and suggests ways of assessing contrast. The logic of this work consists in evaluating the sharpness of an image [...] Read more.

In image processing, the acquisition step plays a fundamental role because it determines image quality. The present paper focuses on the issue of blur and suggests ways of assessing contrast. The logic of this work consists in evaluating the sharpness of an image by means of objective measures based on mathematical, physical, and optical justifications in connection with the human visual system. This is why the Logarithmic Image Processing (LIP) framework was chosen. The sharpness of an image is usually assessed near objects’ boundaries, which encourages the use of gradients, with some major drawbacks. Within the LIP framework, it is possible to overcome such problems using a “contour detector” tool based on the notion of Logarithmic Additive Contrast (LAC). Considering a sequence of images increasingly blurred, we show that the use of LAC enables images to be re-classified in accordance with their defocus level, demonstrating the relevance of the method. The proposed algorithm has been shown to outperform five conventional methods for assessing image sharpness. Moreover, it is the only method that is insensitive to brightness variations. Finally, various application examples are presented, like automatic autofocus control or the comparison of two blur removal algorithms applied to the same image, which particularly concerns the field of Super Resolution (SR) algorithms. Such algorithms multiply (×2, ×3, ×4) the resolution of an image using powerful tools (deep learning, neural networks) while correcting the potential defects (blur, noise) that could be generated by the resolution extension itself. We conclude with the prospects for this work, which should be part of a broader approach to estimating image quality, including sharpness and perceived contrast. Full article

► Show Figures

Figure 1

19 pages, 2755 KiB

Open AccessEditor’s ChoiceArticle

Real-Time Algal Monitoring Using Novel Machine Learning Approaches

by Seyit Uguz, Yavuz Selim Sahin, Pradeep Kumar, Xufei Yang and Gary Anderson

Big Data Cogn. Comput. 2025, 9(6), 153; https://doi.org/10.3390/bdcc9060153 - 9 Jun 2025

Cited by 2 | Viewed by 958

Abstract

Monitoring algal growth rates and estimating microalgae concentration in photobioreactor systems are critical for optimizing production efficiency. Traditional methods—such as microscopy, fluorescence, flow cytometry, spectroscopy, and macroscopic approaches—while accurate, are often costly, time-consuming, labor-intensive, and susceptible to contamination or production interference. To overcome [...] Read more.

Monitoring algal growth rates and estimating microalgae concentration in photobioreactor systems are critical for optimizing production efficiency. Traditional methods—such as microscopy, fluorescence, flow cytometry, spectroscopy, and macroscopic approaches—while accurate, are often costly, time-consuming, labor-intensive, and susceptible to contamination or production interference. To overcome these limitations, this study proposes an automated, real-time, and cost-effective solution by integrating machine learning with image-based analysis. We evaluated the performance of Decision Trees (DTS), Random Forests (RF), Gradient Boosting Machines (GBM), and K-Nearest Neighbors (k-NN) algorithms using RGB color histograms extracted from images of Scenedesmus dimorphus cultures. Ground truth data were obtained via manual cell enumeration under a microscope and dry biomass measurements. Among the models tested, DTS achieved the highest accuracy for cell count prediction (R² = 0.77), while RF demonstrated superior performance for dry biomass estimation (R² = 0.66). Compared to conventional methods, the proposed ML-based approach offers a low-cost, non-invasive, and scalable alternative that significantly reduces manual effort and response time. These findings highlight the potential of machine learning–driven imaging systems for continuous, real-time monitoring in industrial-scale microalgae cultivation. Full article

► Show Figures

Graphical abstract

16 pages, 2702 KiB

Open AccessArticle

Real-Time Image Semantic Segmentation Based on Improved DeepLabv3+ Network

by Peibo Li, Jiangwu Zhou and Xiaohua Xu

Big Data Cogn. Comput. 2025, 9(6), 152; https://doi.org/10.3390/bdcc9060152 - 6 Jun 2025

Viewed by 1175

Abstract

To improve the performance of the image semantic segmentation algorithm and make the algorithm achieve a better balance between accuracy and real-time performance when segmenting images, this paper proposes a real-time image semantic segmentation model based on an improved DeepLabv3+ network. First, the [...] Read more.

To improve the performance of the image semantic segmentation algorithm and make the algorithm achieve a better balance between accuracy and real-time performance when segmenting images, this paper proposes a real-time image semantic segmentation model based on an improved DeepLabv3+ network. First, the MobileNetV2 model with less computational overhead and number of parameters is selected as the backbone network to improve the segmentation speed; then, the Feature Enhancement Module (FEM) is introduced to several shallow features with different scale sizes in MobileNetV2, and then these shallow features are fused to improve the utilization rate of the model encoder on the edge information, to retain more detailed information and to improve the network’s feature representation ability for complex scenes; finally, to address the problem that the output feature maps of Atrous Spatial Pyramid Pooling (ASPP) module do not pay enough attention to detailed information after merging, the FEM attention mechanism is introduced on the feature maps processed by the ASPP module. The algorithm in this study achieves 76.45% for mean intersection over union (mIoU) accuracy with 29.18 FPS real-time performance in the PASCAL VOC2012 Augmented dataset; and 37.31% mIoU accuracy with 23.31 FPS real-time performance in the ADE20K dataset. The experimental results show that the algorithm in this study achieves a good balance between accuracy and real-time performance, and its image semantic segmentation performance is significantly improved compared to DeepLabv3+ and other existing algorithms. Full article

► Show Figures

Figure 1

32 pages, 2079 KiB

Open AccessReview

The Use of Large Language Models in Ophthalmology: A Scoping Review on Current Use-Cases and Considerations for Future Works in This Field

by Ye King Clarence See, Khai Shin Alva Lim, Wei Yung Au, Si Yin Charlene Chia, Xiuyi Fan and Zhenghao Kelvin Li

Big Data Cogn. Comput. 2025, 9(6), 151; https://doi.org/10.3390/bdcc9060151 - 6 Jun 2025

Viewed by 869

Abstract

The advancement of generative artificial intelligence (AI) has resulted in its use permeating many areas of life. Amidst this eruption of scientific output, a wide range of research regarding the usage of Large Language Models (LLMs) in ophthalmology has emerged. In this study, [...] Read more.

The advancement of generative artificial intelligence (AI) has resulted in its use permeating many areas of life. Amidst this eruption of scientific output, a wide range of research regarding the usage of Large Language Models (LLMs) in ophthalmology has emerged. In this study, we aim to map out the landscape of LLM applications in ophthalmology, and by consolidating the work carried out, we aim to produce a point of reference to guide the conduct of future works. Eight databases were searched for articles from 2019 to 2024. In total, 976 studies were screened, and a final 49 were included. The study designs and outcomes of these studies were analysed. The performance of LLMs was further analysed in the areas of exam taking and patient education, diagnostic capability, management capability, administration, inaccuracies, and harm. LLMs performed acceptably in most studies, even surpassing humans in some. Despite their relatively good performance, issues pertaining to study design, grading protocols, hallucinations, inaccuracies, and harm were found to be pervasive. LLMs have received considerable attention through their introduction to the public and have found potential applications in the field of medicine, and in particular, ophthalmology. However, using standardised evaluation frameworks and addressing gaps in the current literature when applying LLMs in ophthalmology is recommended through this review. Full article

► Show Figures

Figure 1

28 pages, 2486 KiB

Open AccessEditor’s ChoiceArticle

A Framework for Rapidly Prototyping Data Mining Pipelines

by Flavio Corradini, Luca Mozzoni, Marco Piangerelli, Barbara Re and Lorenzo Rossi

Big Data Cogn. Comput. 2025, 9(6), 150; https://doi.org/10.3390/bdcc9060150 - 5 Jun 2025

Viewed by 865

Abstract

With the advent of Big Data, data mining techniques have become crucial for improving decision-making across diverse sectors, yet their employment demands significant resources and time. Time is critical in industrial contexts, as delays can lead to increased costs, missed opportunities, and reduced [...] Read more.

With the advent of Big Data, data mining techniques have become crucial for improving decision-making across diverse sectors, yet their employment demands significant resources and time. Time is critical in industrial contexts, as delays can lead to increased costs, missed opportunities, and reduced competitive advantage. To address this, systems for analyzing data can help prototype data mining pipelines, mitigating the risks of failure and resource wastage, especially when experimenting with novel techniques. Moreover, business experts often lack deep technical expertise and need robust support to validate their pipeline designs quickly. This paper presents Rainfall, a novel framework for rapidly prototyping data mining pipelines, developed through collaborative projects with industry. The framework’s requirements stem from a combination of literature review findings, iterative industry engagement, and analysis of existing tools. Rainfall enables the visual programming, execution, monitoring, and management of data mining pipelines, lowering the barrier for non-technical users. Pipelines are composed of configurable nodes that encapsulate functionalities from popular libraries or custom user-defined code, fostering experimentation. The framework is evaluated through a case study and SWOT analysis with INGKA, a large-scale industry partner, alongside usability testing with real users and validation against scenarios from the literature. The paper then underscores the value of industry–academia collaboration in bridging theoretical innovation with practical application. Full article

► Show Figures

Graphical abstract

34 pages, 20058 KiB

Open AccessEditor’s ChoiceArticle

Image First or Text First? Optimising the Sequencing of Modalities in Large Language Model Prompting and Reasoning Tasks

by Grant Wardle and Teo Sušnjak

Big Data Cogn. Comput. 2025, 9(6), 149; https://doi.org/10.3390/bdcc9060149 - 3 Jun 2025

Viewed by 1100

Abstract

Our study investigates how the sequencing of text and image inputs within multi-modal prompts affects the reasoning performance of Large Language Models (LLMs). Through empirical evaluations of three major commercial LLM vendors—OpenAI, Google, and Anthropic—alongside a user study on interaction strategies, we develop [...] Read more.

Our study investigates how the sequencing of text and image inputs within multi-modal prompts affects the reasoning performance of Large Language Models (LLMs). Through empirical evaluations of three major commercial LLM vendors—OpenAI, Google, and Anthropic—alongside a user study on interaction strategies, we develop and validate practical heuristics for optimising multi-modal prompt design. Our findings reveal that modality sequencing is a critical factor influencing reasoning performance, particularly in tasks with varying cognitive load and structural complexity. For simpler tasks involving a single image, positioning the modalities directly impacts model accuracy, whereas in complex, multi-step reasoning scenarios, the sequence must align with the logical structure of inference, often outweighing the specific placement of individual modalities. Furthermore, we identify systematic challenges in multi-hop reasoning within transformer-based architectures, where models demonstrate strong early-stage inference but struggle with integrating prior contextual information in later reasoning steps. Building on these insights, we propose a set of validated, user-centred heuristics for designing effective multi-modal prompts, enhancing both reasoning accuracy and user interaction with AI systems. Our contributions inform the design and usability of interactive intelligent systems, with implications for applications in education, medical imaging, legal document analysis, and customer support. By bridging the gap between intelligent system behaviour and user interaction strategies, this study provides actionable guidance on how users can effectively structure prompts to optimise multi-modal LLM reasoning within real-world, high-stakes decision-making contexts. Full article

► Show Figures

Figure 1

32 pages, 4091 KiB

Open AccessArticle

Improving Early Detection of Dementia: Extra Trees-Based Classification Model Using Inter-Relation-Based Features and K-Means Synthetic Minority Oversampling Technique

by Yanawut Chaiyo, Worasak Rueangsirarak, Georgi Hristov and Punnarumol Temdee

Big Data Cogn. Comput. 2025, 9(6), 148; https://doi.org/10.3390/bdcc9060148 - 30 May 2025

Viewed by 700

Abstract

The early detection of dementia, a condition affecting both individuals and society, is essential for its effective management. However, reliance on advanced laboratory tests and specialized expertise limits accessibility, hindering timely diagnosis. To address this challenge, this study proposes a novel approach in [...] Read more.

The early detection of dementia, a condition affecting both individuals and society, is essential for its effective management. However, reliance on advanced laboratory tests and specialized expertise limits accessibility, hindering timely diagnosis. To address this challenge, this study proposes a novel approach in which readily available biochemical and physiological features from electronic health records are employed to develop a machine learning-based binary classification model, improving accessibility and early detection. A dataset of 14,763 records from Phachanukroh Hospital, Chiang Rai, Thailand, was used for model construction. The use of a hybrid data enrichment framework involving feature augmentation and data balancing was proposed in order to increase the dimensionality of the data. Medical domain knowledge was used to generate inter-relation-based features (IRFs), which improve data diversity and promote explainability by making the features more informative. For data balancing, the K-Means Synthetic Minority Oversampling Technique (K-Means SMOTE) was applied to generate synthetic samples in under-represented regions of the feature space, addressing class imbalance. Extra Trees (ET) was used for model construction due to its noise resilience and ability to manage multicollinearity. The performance of the proposed method was compared with that of Support Vector Machine, K-Nearest Neighbors, Artificial Neural Networks, Random Forest, and Gradient Boosting. The results reveal that the ET model significantly outperformed other models on the combined dataset with four IRFs and K-Means SMOTE across key metrics, including accuracy (96.47%), precision (94.79%), recall (97.86%), F1 score (96.30%), and area under the receiver operating characteristic curve (99.51%). Full article

► Show Figures

Figure 1

44 pages, 1434 KiB

Open AccessEditor’s ChoiceReview

The Importance of AI Data Governance in Large Language Models

by Saurabh Pahune, Zahid Akhtar, Venkatesh Mandapati and Kamran Siddique

Big Data Cogn. Comput. 2025, 9(6), 147; https://doi.org/10.3390/bdcc9060147 - 28 May 2025

Cited by 1 | Viewed by 3686

Abstract

AI data governance is a crucial framework for ensuring that data are utilized in the lifecycle of large language model (LLM) activity, from the development process to the end-to-end testing process, model validation, secure deployment, and operations. This requires the data to be [...] Read more.

AI data governance is a crucial framework for ensuring that data are utilized in the lifecycle of large language model (LLM) activity, from the development process to the end-to-end testing process, model validation, secure deployment, and operations. This requires the data to be managed responsibly, confidentially, securely, and ethically. The main objective of data governance is to implement a robust and intelligent data governance framework for LLMs, which tends to impact data quality management, the fine-tuning of model performance, biases, data privacy laws, security protocols, ethical AI practices, and regulatory compliance processes in LLMs. Effective data governance steps are important for minimizing data breach activity, enhancing data security, ensuring compliance and regulations, mitigating bias, and establishing clear policies and guidelines. This paper covers the foundation of AI data governance, key components, types of data governance, best practices, case studies, challenges, and future directions of data governance in LLMs. Additionally, we conduct a comprehensive detailed analysis of data governance and how efficient the integration of AI data governance must be for LLMs to gain a trustable approach for the end user. Finally, we provide deeper insights into the comprehensive exploration of the relevance of the data governance framework to the current landscape of LLMs in the healthcare, pharmaceutical, finance, supply chain management, and cybersecurity sectors and address the essential roles to take advantage of the approach of data governance frameworks and their effectiveness and limitations. Full article

► Show Figures

Figure 1

15 pages, 1196 KiB

Open AccessEditor’s ChoiceArticle

Bone Segmentation in Low-Field Knee MRI Using a Three-Dimensional Convolutional Neural Network

by Ciro Listone, Diego Romano and Marco Lapegna

Big Data Cogn. Comput. 2025, 9(6), 146; https://doi.org/10.3390/bdcc9060146 - 28 May 2025

Viewed by 744

Abstract

Bone segmentation in magnetic resonance imaging (MRI) is crucial for clinical and research applications, including diagnosis, surgical planning, and treatment monitoring. However, it remains challenging due to anatomical variability and complex bone morphology. Manual segmentation is time-consuming and operator-dependent, fostering interest in automated [...] Read more.

Bone segmentation in magnetic resonance imaging (MRI) is crucial for clinical and research applications, including diagnosis, surgical planning, and treatment monitoring. However, it remains challenging due to anatomical variability and complex bone morphology. Manual segmentation is time-consuming and operator-dependent, fostering interest in automated methods. This study proposes an automated segmentation method based on a 3D U-Net convolutional neural network to segment the femur, tibia, and patella from low-field MRI scans. Low-field MRI offers advantages in cost, patient comfort, and accessibility but presents challenges related to lower signal quality. Our method achieved a Dice Similarity Coefficient (DSC) of 0.9838, Intersection over Union (IoU) of 0.9682, and Average Hausdorff Distance (AHD) of 0.0223, with an inference time of approximately 3.96 s per volume on a GPU. Although post-processing had minimal impact on metrics, it significantly enhanced the visual smoothness of bone surfaces, which is crucial for clinical use. The final segmentations enabled the creation of clean, 3D-printable bone models, beneficial for preoperative planning. These results demonstrate that the model achieves accurate segmentation with a high degree of overlap compared to manually segmented reference data. This accuracy results from meticulous fine-tuning of the network, along with the application of advanced data augmentation and post-processing techniques. Full article

► Show Figures

Figure 1

18 pages, 597 KiB

Open AccessEditor’s ChoiceArticle

No-Code Edge Artificial Intelligence Frameworks Comparison Using a Multi-Sensor Predictive Maintenance Dataset

by Juan M. Montes-Sánchez, Plácido Fernández-Cuevas, Francisco Luna-Perejón, Saturnino Vicente-Diaz and Ángel Jiménez-Fernández

Big Data Cogn. Comput. 2025, 9(6), 145; https://doi.org/10.3390/bdcc9060145 - 26 May 2025

Viewed by 1094

Abstract

Edge Computing (EC) is one of the proposed solutions to address the problems that the industry is facing when implementing Predictive Maintenance (PdM) implementations that can benefit from Edge Artificial Intelligence (Edge AI) systems. In this work, we have compared six of the [...] Read more.

Edge Computing (EC) is one of the proposed solutions to address the problems that the industry is facing when implementing Predictive Maintenance (PdM) implementations that can benefit from Edge Artificial Intelligence (Edge AI) systems. In this work, we have compared six of the most popular no-code Edge AI frameworks in the market. The comparison considers economic cost, the number of features, usability, and performance. We used a combination of the analytic hierarchy process (AHP) and the technique for order performance by similarity to the ideal solution (TOPSIS) to compare the frameworks. We consulted ten independent experts on Edge AI, four employed in industry and the other six in academia. These experts defined the importance of each criterion by deciding the weights of TOPSIS using AHP. We performed two different classification tests on each framework platform using data from a public dataset for PdM on biomedical equipment. Magnetometer data were used for test 1, and accelerometer data were used for test 2. We obtained the F1 score, flash memory, and latency metrics. There was a high level of consensus between the worlds of academia and industry when assigning the weights. Therefore, the overall comparison ranked the analyzed frameworks similarly. NanoEdgeAIStudio ranked first when considering all weights and industry only weights, and Edge Impulse was the first option when using academia only weights. In terms of performance, there is room for improvement in most frameworks, as they did not reach the metrics of the previously developed custom Edge AI solution. We identified some limitations that should be fixed to improve the comparison method in the future, like adding weights to the feature criteria or increasing the number and variety of performance tests. Full article

(This article belongs to the Topic eHealth and mHealth: Challenges and Prospects, 2nd Edition)

► Show Figures

Figure 1

33 pages, 958 KiB

Open AccessArticle

The Impact of Blockchain Technology and Dynamic Capabilities on Banks’ Performance

by Abayomi Ogunrinde, Carmen De-Pablos-Heredero, José-Luis Montes-Botella and Luis Fernández-Sanz

Big Data Cogn. Comput. 2025, 9(6), 144; https://doi.org/10.3390/bdcc9060144 - 23 May 2025

Viewed by 2096

Abstract

Blockchain technology has sparked significant interest and is currently being researched by academics and practitioners due to its potential to reduce transaction costs, improve the security of transactions, increase transparency, etc. However, there is still much doubt about its impact, and the technology [...] Read more.

Blockchain technology has sparked significant interest and is currently being researched by academics and practitioners due to its potential to reduce transaction costs, improve the security of transactions, increase transparency, etc. However, there is still much doubt about its impact, and the technology is still in its infancy, with varying degrees of adoption among different financial institutions. Structural Equation Modeling (SEM) analysis was utilized to test the impact of blockchain and dynamic capabilities on the Bank’s Performance of top banks in Spain. The innovative approach seeks to understand how performance can be improved by deploying blockchain technology (BC) in banks. Results showed a significant association between banks’ adoption of blockchain and the generation of dynamic capabilities and financial performance. Thus, we can confirm that a bank adopting blockchain will more likely create dynamic capabilities than those that do not. Hence, blockchain technology is an important tool for achieving dynamic capabilities and increasing performance in banks. Based on the findings, we suggest areas for additional research and highlight policy considerations related to the wider adoption of blockchain technology. Full article

► Show Figures

Figure 1

34 pages, 3941 KiB

Open AccessArticle

Ship Typhoon Avoidance Route Planning Method Under Uncertain Typhoon Forecasts

by Zhengwei He, Junhong Guo, Weihao Ma and Jinfeng Zhang

Big Data Cogn. Comput. 2025, 9(6), 143; https://doi.org/10.3390/bdcc9060143 - 23 May 2025

Viewed by 651

Abstract

Formulating effective typhoon avoidance routes is crucial for ensuring the safe navigation of ocean-going vessels. From a maritime safety perspective, this paper investigates ship route optimization under typhoon forecast uncertainty. Initially, the study calculates the probability of a ship encountering a typhoon based [...] Read more.

Formulating effective typhoon avoidance routes is crucial for ensuring the safe navigation of ocean-going vessels. From a maritime safety perspective, this paper investigates ship route optimization under typhoon forecast uncertainty. Initially, the study calculates the probability of a ship encountering a typhoon based on the distribution of historical typhoon data within the radius of seven-level winds and the distance between the ship and the typhoon. Subsequently, the minimum safe distance is quantified, and a multi-objective ship route optimization model for typhoon avoidance is established. A three-dimensional multi-objective ant colony algorithm is designed to solve this model. Finally, a typhoon avoidance simulation experiment is conducted using Typhoon TAMRI and a classic route in the South China Sea as a case study. The experimental results demonstrate that under adverse conditions of uncertain typhoon forecasts, the proposed multi-objective typhoon avoidance route optimization model can effectively avoid high wind and wave areas of the typhoon while balancing and optimizing multiple navigation indicators. This model can serve as a reference for shipping companies in formulating typhoon avoidance strategies. Full article

(This article belongs to the Special Issue Application of Artificial Intelligence in Traffic Management)

► Show Figures

Figure 1

20 pages, 932 KiB

Open AccessArticle

Predicting the Damage of Urban Fires with Grammatical Evolution

by Constantina Kopitsa, Ioannis G. Tsoulos, Andreas Miltiadous and Vasileios Charilogis

Big Data Cogn. Comput. 2025, 9(6), 142; https://doi.org/10.3390/bdcc9060142 - 22 May 2025

Viewed by 799

Abstract

Fire, whether wild or urban, depends on the triad of oxygen, fuel, and heat. Urban fires, although smaller in scale, have devastating impacts, as evidenced by the 2018 wildfire in Mati, Attica (Greece), which claimed 104 lives. The elderly and children are the [...] Read more.

Fire, whether wild or urban, depends on the triad of oxygen, fuel, and heat. Urban fires, although smaller in scale, have devastating impacts, as evidenced by the 2018 wildfire in Mati, Attica (Greece), which claimed 104 lives. The elderly and children are the most vulnerable due to mobility and cognitive limitations. This study applies Grammatical Evolution (GE), a machine learning method that generates interpretable classification rules to predict the consequences of urban fires. Using historical data (casualties, containment time, and meteorological/demographic parameters), GE produces classification rules in human-readable form. The rules achieve over 85% accuracy, revealing critical correlations. For example, high temperatures (>35 °C) combined with irregular building layouts exponentially increase fatality risks, while firefighter response time proves more critical than fire intensity itself. Applications include dynamic evacuation strategies (real-time adaptation), preventive urban planning (fire-resistant materials and green buffer zones), and targeted awareness campaigns for at-risk groups. Unlike “black-box” machine learning techniques, GE offers transparent human-readable rules, enabling firefighters and authorities to make rapid informed decisions. Future advancements could integrate real-time data (IoT sensors and satellites) and extend the methodology to other natural disasters. Protecting urban centers from fires is not only a technological challenge but also a moral imperative to safeguard human lives and societal cohesion. Full article

► Show Figures

Figure 1

Journal Menu

Journal Browser

Big Data Cogn. Comput., Volume 9, Issue 6 (June 2025) – 21 articles

Further Information

Guidelines

MDPI Initiatives

Follow MDPI