Journal Description
Big Data and Cognitive Computing
Big Data and Cognitive Computing
is an international, peer-reviewed, open access journal on big data and cognitive computing published monthly online by MDPI.
- Open Access— free for readers, with article processing charges (APC) paid by authors or their institutions.
- High Visibility: indexed within Scopus, ESCI (Web of Science), dblp, Inspec, Ei Compendex, and other databases.
- Journal Rank: JCR - Q1 (Computer Science, Theory and Methods) / CiteScore - Q1 (Computer Science Applications)
- Rapid Publication: manuscripts are peer-reviewed and a first decision is provided to authors approximately 25.3 days after submission; acceptance to publication is undertaken in 5.6 days (median values for papers published in this journal in the second half of 2024).
- Recognition of Reviewers: reviewers who provide timely, thorough peer-review reports receive vouchers entitling them to a discount on the APC of their next publication in any MDPI journal, in appreciation of the work done.
Impact Factor:
3.7 (2023)
Latest Articles
Image Visual Quality: Sharpness Evaluation in the Logarithmic Image Processing Framework
Big Data Cogn. Comput. 2025, 9(6), 154; https://doi.org/10.3390/bdcc9060154 - 9 Jun 2025
Abstract
►
Show Figures
In image processing, the acquisition step plays a fundamental role because it determines image quality. The present paper focuses on the issue of blur and suggests ways of assessing contrast. The logic of this work consists in evaluating the sharpness of an image
[...] Read more.
In image processing, the acquisition step plays a fundamental role because it determines image quality. The present paper focuses on the issue of blur and suggests ways of assessing contrast. The logic of this work consists in evaluating the sharpness of an image by means of objective measures based on mathematical, physical, and optical justifications in connection with the human visual system. This is why the Logarithmic Image Processing (LIP) framework was chosen. The sharpness of an image is usually assessed near objects’ boundaries, which encourages the use of gradients, with some major drawbacks. Within the LIP framework, it is possible to overcome such problems using a “contour detector” tool based on the notion of Logarithmic Additive Contrast (LAC). Considering a sequence of images increasingly blurred, we show that the use of LAC enables images to be re-classified in accordance with their defocus level, demonstrating the relevance of the method. The proposed algorithm has been shown to outperform five conventional methods for assessing image sharpness. Moreover, it is the only method that is insensitive to brightness variations. Finally, various application examples are presented, like automatic autofocus control or the comparison of two blur removal algorithms applied to the same image, which particularly concerns the field of Super Resolution (SR) algorithms. Such algorithms multiply (×2, ×3, ×4) the resolution of an image using powerful tools (deep learning, neural networks) while correcting the potential defects (blur, noise) that could be generated by the resolution extension itself. We conclude with the prospects for this work, which should be part of a broader approach to estimating image quality, including sharpness and perceived contrast.
Full article
Open AccessArticle
Real-Time Algal Monitoring Using Novel Machine Learning Approaches
by
Seyit Uguz, Yavuz Selim Sahin, Pradeep Kumar, Xufei Yang and Gary Anderson
Big Data Cogn. Comput. 2025, 9(6), 153; https://doi.org/10.3390/bdcc9060153 - 9 Jun 2025
Abstract
►▼
Show Figures
Monitoring algal growth rates and estimating microalgae concentration in photobioreactor systems are critical for optimizing production efficiency. Traditional methods—such as microscopy, fluorescence, flow cytometry, spectroscopy, and macroscopic approaches—while accurate, are often costly, time-consuming, labor-intensive, and susceptible to contamination or production interference. To overcome
[...] Read more.
Monitoring algal growth rates and estimating microalgae concentration in photobioreactor systems are critical for optimizing production efficiency. Traditional methods—such as microscopy, fluorescence, flow cytometry, spectroscopy, and macroscopic approaches—while accurate, are often costly, time-consuming, labor-intensive, and susceptible to contamination or production interference. To overcome these limitations, this study proposes an automated, real-time, and cost-effective solution by integrating machine learning with image-based analysis. We evaluated the performance of Decision Trees (DTS), Random Forests (RF), Gradient Boosting Machines (GBM), and K-Nearest Neighbors (k-NN) algorithms using RGB color histograms extracted from images of Scenedesmus dimorphus cultures. Ground truth data were obtained via manual cell enumeration under a microscope and dry biomass measurements. Among the models tested, DTS achieved the highest accuracy for cell count prediction (R2 = 0.77), while RF demonstrated superior performance for dry biomass estimation (R2 = 0.66). Compared to conventional methods, the proposed ML-based approach offers a low-cost, non-invasive, and scalable alternative that significantly reduces manual effort and response time. These findings highlight the potential of machine learning–driven imaging systems for continuous, real-time monitoring in industrial-scale microalgae cultivation.
Full article

Graphical abstract
Open AccessArticle
Real-Time Image Semantic Segmentation Based on Improved DeepLabv3+ Network
by
Peibo Li, Jiangwu Zhou and Xiaohua Xu
Big Data Cogn. Comput. 2025, 9(6), 152; https://doi.org/10.3390/bdcc9060152 - 6 Jun 2025
Abstract
►▼
Show Figures
To improve the performance of the image semantic segmentation algorithm and make the algorithm achieve a better balance between accuracy and real-time performance when segmenting images, this paper proposes a real-time image semantic segmentation model based on an improved DeepLabv3+ network. First, the
[...] Read more.
To improve the performance of the image semantic segmentation algorithm and make the algorithm achieve a better balance between accuracy and real-time performance when segmenting images, this paper proposes a real-time image semantic segmentation model based on an improved DeepLabv3+ network. First, the MobileNetV2 model with less computational overhead and number of parameters is selected as the backbone network to improve the segmentation speed; then, the Feature Enhancement Module (FEM) is introduced to several shallow features with different scale sizes in MobileNetV2, and then these shallow features are fused to improve the utilization rate of the model encoder on the edge information, to retain more detailed information and to improve the network’s feature representation ability for complex scenes; finally, to address the problem that the output feature maps of Atrous Spatial Pyramid Pooling (ASPP) module do not pay enough attention to detailed information after merging, the FEM attention mechanism is introduced on the feature maps processed by the ASPP module. The algorithm in this study achieves 76.45% for mean intersection over union (mIoU) accuracy with 29.18 FPS real-time performance in the PASCAL VOC2012 Augmented dataset; and 37.31% mIoU accuracy with 23.31 FPS real-time performance in the ADE20K dataset. The experimental results show that the algorithm in this study achieves a good balance between accuracy and real-time performance, and its image semantic segmentation performance is significantly improved compared to DeepLabv3+ and other existing algorithms.
Full article

Figure 1
Open AccessReview
The Use of Large Language Models in Ophthalmology: A Scoping Review on Current Use-Cases and Considerations for Future Works in This Field
by
See Ye King Clarence, Lim Khai Shin Alva, Au Wei Yung, Chia Si Yin Charlene, Fan Xiuyi and Li Zhenghao Kelvin
Big Data Cogn. Comput. 2025, 9(6), 151; https://doi.org/10.3390/bdcc9060151 - 6 Jun 2025
Abstract
The advancement of generative artificial intelligence (AI) has resulted in its use permeating many areas of life. Amidst this eruption of scientific output, a wide range of research regarding the usage of Large Language Models (LLMs) in ophthalmology has emerged. In this study,
[...] Read more.
The advancement of generative artificial intelligence (AI) has resulted in its use permeating many areas of life. Amidst this eruption of scientific output, a wide range of research regarding the usage of Large Language Models (LLMs) in ophthalmology has emerged. In this study, we aim to map out the landscape of LLM applications in ophthalmology, and by consolidating the work carried out, we aim to produce a point of reference to guide the conduct of future works. Eight databases were searched for articles from 2019 to 2024. In total, 976 studies were screened, and a final 49 were included. The study designs and outcomes of these studies were analysed. The performance of LLMs was further analysed in the areas of exam taking and patient education, diagnostic capability, management capability, administration, inaccuracies, and harm. LLMs performed acceptably in most studies, even surpassing humans in some. Despite their relatively good performance, issues pertaining to study design, grading protocols, hallucinations, inaccuracies, and harm were found to be pervasive. LLMs have received considerable attention through their introduction to the public and have found potential applications in the field of medicine, and in particular, ophthalmology. However, using standardised evaluation frameworks and addressing gaps in the current literature when applying LLMs in ophthalmology is recommended through this review.
Full article
Open AccessArticle
A Framework for Rapidly Prototyping Data Mining Pipelines
by
Flavio Corradini, Luca Mozzoni, Marco Piangerelli, Barbara Re and Lorenzo Rossi
Big Data Cogn. Comput. 2025, 9(6), 150; https://doi.org/10.3390/bdcc9060150 - 5 Jun 2025
Abstract
►▼
Show Figures
With the advent of Big Data, data mining techniques have become crucial for improving decision-making across diverse sectors, yet their employment demands significant resources and time. Time is critical in industrial contexts, as delays can lead to increased costs, missed opportunities, and reduced
[...] Read more.
With the advent of Big Data, data mining techniques have become crucial for improving decision-making across diverse sectors, yet their employment demands significant resources and time. Time is critical in industrial contexts, as delays can lead to increased costs, missed opportunities, and reduced competitive advantage. To address this, systems for analyzing data can help prototype data mining pipelines, mitigating the risks of failure and resource wastage, especially when experimenting with novel techniques. Moreover, business experts often lack deep technical expertise and need robust support to validate their pipeline designs quickly. This paper presents Rainfall, a novel framework for rapidly prototyping data mining pipelines, developed through collaborative projects with industry. The framework’s requirements stem from a combination of literature review findings, iterative industry engagement, and analysis of existing tools. Rainfall enables the visual programming, execution, monitoring, and management of data mining pipelines, lowering the barrier for non-technical users. Pipelines are composed of configurable nodes that encapsulate functionalities from popular libraries or custom user-defined code, fostering experimentation. The framework is evaluated through a case study and SWOT analysis with INGKA, a large-scale industry partner, alongside usability testing with real users and validation against scenarios from the literature. The paper then underscores the value of industry–academia collaboration in bridging theoretical innovation with practical application.
Full article

Graphical abstract
Open AccessArticle
Image First or Text First? Optimising the Sequencing of Modalities in Large Language Model Prompting and Reasoning Tasks
by
Grant Wardle and Teo Sušnjak
Big Data Cogn. Comput. 2025, 9(6), 149; https://doi.org/10.3390/bdcc9060149 - 3 Jun 2025
Abstract
►▼
Show Figures
Our study investigates how the sequencing of text and image inputs within multi-modal prompts affects the reasoning performance of Large Language Models (LLMs). Through empirical evaluations of three major commercial LLM vendors—OpenAI, Google, and Anthropic—alongside a user study on interaction strategies, we develop
[...] Read more.
Our study investigates how the sequencing of text and image inputs within multi-modal prompts affects the reasoning performance of Large Language Models (LLMs). Through empirical evaluations of three major commercial LLM vendors—OpenAI, Google, and Anthropic—alongside a user study on interaction strategies, we develop and validate practical heuristics for optimising multi-modal prompt design. Our findings reveal that modality sequencing is a critical factor influencing reasoning performance, particularly in tasks with varying cognitive load and structural complexity. For simpler tasks involving a single image, positioning the modalities directly impacts model accuracy, whereas in complex, multi-step reasoning scenarios, the sequence must align with the logical structure of inference, often outweighing the specific placement of individual modalities. Furthermore, we identify systematic challenges in multi-hop reasoning within transformer-based architectures, where models demonstrate strong early-stage inference but struggle with integrating prior contextual information in later reasoning steps. Building on these insights, we propose a set of validated, user-centred heuristics for designing effective multi-modal prompts, enhancing both reasoning accuracy and user interaction with AI systems. Our contributions inform the design and usability of interactive intelligent systems, with implications for applications in education, medical imaging, legal document analysis, and customer support. By bridging the gap between intelligent system behaviour and user interaction strategies, this study provides actionable guidance on how users can effectively structure prompts to optimise multi-modal LLM reasoning within real-world, high-stakes decision-making contexts.
Full article

Figure 1
Open AccessArticle
Improving Early Detection of Dementia: Extra Trees-Based Classification Model Using Inter-Relation-Based Features and K-Means Synthetic Minority Oversampling Technique
by
Yanawut Chaiyo, Worasak Rueangsirarak, Georgi Hristov and Punnarumol Temdee
Big Data Cogn. Comput. 2025, 9(6), 148; https://doi.org/10.3390/bdcc9060148 - 30 May 2025
Abstract
►▼
Show Figures
The early detection of dementia, a condition affecting both individuals and society, is essential for its effective management. However, reliance on advanced laboratory tests and specialized expertise limits accessibility, hindering timely diagnosis. To address this challenge, this study proposes a novel approach in
[...] Read more.
The early detection of dementia, a condition affecting both individuals and society, is essential for its effective management. However, reliance on advanced laboratory tests and specialized expertise limits accessibility, hindering timely diagnosis. To address this challenge, this study proposes a novel approach in which readily available biochemical and physiological features from electronic health records are employed to develop a machine learning-based binary classification model, improving accessibility and early detection. A dataset of 14,763 records from Phachanukroh Hospital, Chiang Rai, Thailand, was used for model construction. The use of a hybrid data enrichment framework involving feature augmentation and data balancing was proposed in order to increase the dimensionality of the data. Medical domain knowledge was used to generate inter-relation-based features (IRFs), which improve data diversity and promote explainability by making the features more informative. For data balancing, the K-Means Synthetic Minority Oversampling Technique (K-Means SMOTE) was applied to generate synthetic samples in under-represented regions of the feature space, addressing class imbalance. Extra Trees (ET) was used for model construction due to its noise resilience and ability to manage multicollinearity. The performance of the proposed method was compared with that of Support Vector Machine, K-Nearest Neighbors, Artificial Neural Networks, Random Forest, and Gradient Boosting. The results reveal that the ET model significantly outperformed other models on the combined dataset with four IRFs and K-Means SMOTE across key metrics, including accuracy (96.47%), precision (94.79%), recall (97.86%), F1 score (96.30%), and area under the receiver operating characteristic curve (99.51%).
Full article

Figure 1
Open AccessReview
The Importance of AI Data Governance in Large Language Models
by
Saurabh Pahune, Zahid Akhtar, Venkatesh Mandapati and Kamran Siddique
Big Data Cogn. Comput. 2025, 9(6), 147; https://doi.org/10.3390/bdcc9060147 - 28 May 2025
Abstract
►▼
Show Figures
AI data governance is a crucial framework for ensuring that data are utilized in the lifecycle of large language model (LLM) activity, from the development process to the end-to-end testing process, model validation, secure deployment, and operations. This requires the data to be
[...] Read more.
AI data governance is a crucial framework for ensuring that data are utilized in the lifecycle of large language model (LLM) activity, from the development process to the end-to-end testing process, model validation, secure deployment, and operations. This requires the data to be managed responsibly, confidentially, securely, and ethically. The main objective of data governance is to implement a robust and intelligent data governance framework for LLMs, which tends to impact data quality management, the fine-tuning of model performance, biases, data privacy laws, security protocols, ethical AI practices, and regulatory compliance processes in LLMs. Effective data governance steps are important for minimizing data breach activity, enhancing data security, ensuring compliance and regulations, mitigating bias, and establishing clear policies and guidelines. This paper covers the foundation of AI data governance, key components, types of data governance, best practices, case studies, challenges, and future directions of data governance in LLMs. Additionally, we conduct a comprehensive detailed analysis of data governance and how efficient the integration of AI data governance must be for LLMs to gain a trustable approach for the end user. Finally, we provide deeper insights into the comprehensive exploration of the relevance of the data governance framework to the current landscape of LLMs in the healthcare, pharmaceutical, finance, supply chain management, and cybersecurity sectors and address the essential roles to take advantage of the approach of data governance frameworks and their effectiveness and limitations.
Full article

Figure 1
Open AccessArticle
Bone Segmentation in Low-Field Knee MRI Using a Three-Dimensional Convolutional Neural Network
by
Ciro Listone, Diego Romano and Marco Lapegna
Big Data Cogn. Comput. 2025, 9(6), 146; https://doi.org/10.3390/bdcc9060146 - 28 May 2025
Abstract
►▼
Show Figures
Bone segmentation in magnetic resonance imaging (MRI) is crucial for clinical and research applications, including diagnosis, surgical planning, and treatment monitoring. However, it remains challenging due to anatomical variability and complex bone morphology. Manual segmentation is time-consuming and operator-dependent, fostering interest in automated
[...] Read more.
Bone segmentation in magnetic resonance imaging (MRI) is crucial for clinical and research applications, including diagnosis, surgical planning, and treatment monitoring. However, it remains challenging due to anatomical variability and complex bone morphology. Manual segmentation is time-consuming and operator-dependent, fostering interest in automated methods. This study proposes an automated segmentation method based on a 3D U-Net convolutional neural network to segment the femur, tibia, and patella from low-field MRI scans. Low-field MRI offers advantages in cost, patient comfort, and accessibility but presents challenges related to lower signal quality. Our method achieved a Dice Similarity Coefficient (DSC) of 0.9838, Intersection over Union (IoU) of 0.9682, and Average Hausdorff Distance (AHD) of 0.0223, with an inference time of approximately 3.96 s per volume on a GPU. Although post-processing had minimal impact on metrics, it significantly enhanced the visual smoothness of bone surfaces, which is crucial for clinical use. The final segmentations enabled the creation of clean, 3D-printable bone models, beneficial for preoperative planning. These results demonstrate that the model achieves accurate segmentation with a high degree of overlap compared to manually segmented reference data. This accuracy results from meticulous fine-tuning of the network, along with the application of advanced data augmentation and post-processing techniques.
Full article

Figure 1
Open AccessArticle
No-Code Edge Artificial Intelligence Frameworks Comparison Using a Multi-Sensor Predictive Maintenance Dataset
by
Juan M. Montes-Sánchez, Plácido Fernández-Cuevas, Francisco Luna-Perejón, Saturnino Vicente-Diaz and Ángel Jiménez-Fernández
Big Data Cogn. Comput. 2025, 9(6), 145; https://doi.org/10.3390/bdcc9060145 - 26 May 2025
Abstract
Edge Computing (EC) is one of the proposed solutions to address the problems that the industry is facing when implementing Predictive Maintenance (PdM) implementations that can benefit from Edge Artificial Intelligence (Edge AI) systems. In this work, we have compared six of the
[...] Read more.
Edge Computing (EC) is one of the proposed solutions to address the problems that the industry is facing when implementing Predictive Maintenance (PdM) implementations that can benefit from Edge Artificial Intelligence (Edge AI) systems. In this work, we have compared six of the most popular no-code Edge AI frameworks in the market. The comparison considers economic cost, the number of features, usability, and performance. We used a combination of the analytic hierarchy process (AHP) and the technique for order performance by similarity to the ideal solution (TOPSIS) to compare the frameworks. We consulted ten independent experts on Edge AI, four employed in industry and the other six in academia. These experts defined the importance of each criterion by deciding the weights of TOPSIS using AHP. We performed two different classification tests on each framework platform using data from a public dataset for PdM on biomedical equipment. Magnetometer data were used for test 1, and accelerometer data were used for test 2. We obtained the F1 score, flash memory, and latency metrics. There was a high level of consensus between the worlds of academia and industry when assigning the weights. Therefore, the overall comparison ranked the analyzed frameworks similarly. NanoEdgeAIStudio ranked first when considering all weights and industry only weights, and Edge Impulse was the first option when using academia only weights. In terms of performance, there is room for improvement in most frameworks, as they did not reach the metrics of the previously developed custom Edge AI solution. We identified some limitations that should be fixed to improve the comparison method in the future, like adding weights to the feature criteria or increasing the number and variety of performance tests.
Full article
(This article belongs to the Topic eHealth and mHealth: Challenges and Prospects, 2nd Edition)
►▼
Show Figures

Figure 1
Open AccessArticle
The Impact of Blockchain Technology and Dynamic Capabilities on Banks’ Performance
by
Abayomi Ogunrinde, Carmen De-Pablos-Heredero, José-Luis Montes-Botella and Luis Fernández-Sanz
Big Data Cogn. Comput. 2025, 9(6), 144; https://doi.org/10.3390/bdcc9060144 - 23 May 2025
Abstract
►▼
Show Figures
Blockchain technology has sparked significant interest and is currently being researched by academics and practitioners due to its potential to reduce transaction costs, improve the security of transactions, increase transparency, etc. However, there is still much doubt about its impact, and the technology
[...] Read more.
Blockchain technology has sparked significant interest and is currently being researched by academics and practitioners due to its potential to reduce transaction costs, improve the security of transactions, increase transparency, etc. However, there is still much doubt about its impact, and the technology is still in its infancy, with varying degrees of adoption among different financial institutions. Structural Equation Modeling (SEM) analysis was utilized to test the impact of blockchain and dynamic capabilities on the Bank’s Performance of top banks in Spain. The innovative approach seeks to understand how performance can be improved by deploying blockchain technology (BC) in banks. Results showed a significant association between banks’ adoption of blockchain and the generation of dynamic capabilities and financial performance. Thus, we can confirm that a bank adopting blockchain will more likely create dynamic capabilities than those that do not. Hence, blockchain technology is an important tool for achieving dynamic capabilities and increasing performance in banks. Based on the findings, we suggest areas for additional research and highlight policy considerations related to the wider adoption of blockchain technology.
Full article

Figure 1
Open AccessArticle
Ship Typhoon Avoidance Route Planning Method Under Uncertain Typhoon Forecasts
by
Zhengwei He, Junhong Guo, Weihao Ma and Jinfeng Zhang
Big Data Cogn. Comput. 2025, 9(6), 143; https://doi.org/10.3390/bdcc9060143 - 23 May 2025
Abstract
Formulating effective typhoon avoidance routes is crucial for ensuring the safe navigation of ocean-going vessels. From a maritime safety perspective, this paper investigates ship route optimization under typhoon forecast uncertainty. Initially, the study calculates the probability of a ship encountering a typhoon based
[...] Read more.
Formulating effective typhoon avoidance routes is crucial for ensuring the safe navigation of ocean-going vessels. From a maritime safety perspective, this paper investigates ship route optimization under typhoon forecast uncertainty. Initially, the study calculates the probability of a ship encountering a typhoon based on the distribution of historical typhoon data within the radius of seven-level winds and the distance between the ship and the typhoon. Subsequently, the minimum safe distance is quantified, and a multi-objective ship route optimization model for typhoon avoidance is established. A three-dimensional multi-objective ant colony algorithm is designed to solve this model. Finally, a typhoon avoidance simulation experiment is conducted using Typhoon TAMRI and a classic route in the South China Sea as a case study. The experimental results demonstrate that under adverse conditions of uncertain typhoon forecasts, the proposed multi-objective typhoon avoidance route optimization model can effectively avoid high wind and wave areas of the typhoon while balancing and optimizing multiple navigation indicators. This model can serve as a reference for shipping companies in formulating typhoon avoidance strategies.
Full article
(This article belongs to the Special Issue Application of Artificial Intelligence in Traffic Management)
►▼
Show Figures

Figure 1
Open AccessArticle
Predicting the Damage of Urban Fires with Grammatical Evolution
by
Constantina Kopitsa, Ioannis G. Tsoulos, Andreas Miltiadous and Vasileios Charilogis
Big Data Cogn. Comput. 2025, 9(6), 142; https://doi.org/10.3390/bdcc9060142 - 22 May 2025
Abstract
►▼
Show Figures
Fire, whether wild or urban, depends on the triad of oxygen, fuel, and heat. Urban fires, although smaller in scale, have devastating impacts, as evidenced by the 2018 wildfire in Mati, Attica (Greece), which claimed 104 lives. The elderly and children are the
[...] Read more.
Fire, whether wild or urban, depends on the triad of oxygen, fuel, and heat. Urban fires, although smaller in scale, have devastating impacts, as evidenced by the 2018 wildfire in Mati, Attica (Greece), which claimed 104 lives. The elderly and children are the most vulnerable due to mobility and cognitive limitations. This study applies Grammatical Evolution (GE), a machine learning method that generates interpretable classification rules to predict the consequences of urban fires. Using historical data (casualties, containment time, and meteorological/demographic parameters), GE produces classification rules in human-readable form. The rules achieve over 85% accuracy, revealing critical correlations. For example, high temperatures (>35 °C) combined with irregular building layouts exponentially increase fatality risks, while firefighter response time proves more critical than fire intensity itself. Applications include dynamic evacuation strategies (real-time adaptation), preventive urban planning (fire-resistant materials and green buffer zones), and targeted awareness campaigns for at-risk groups. Unlike “black-box” machine learning techniques, GE offers transparent human-readable rules, enabling firefighters and authorities to make rapid informed decisions. Future advancements could integrate real-time data (IoT sensors and satellites) and extend the methodology to other natural disasters. Protecting urban centers from fires is not only a technological challenge but also a moral imperative to safeguard human lives and societal cohesion.
Full article

Figure 1
Open AccessArticle
A Comprehensive Evaluation of Embedding Models and LLMs for IR and QA Across English and Italian
by
Ermelinda Oro, Francesco Maria Granata and Massimo Ruffolo
Big Data Cogn. Comput. 2025, 9(5), 141; https://doi.org/10.3390/bdcc9050141 - 21 May 2025
Abstract
This study presents a comprehensive evaluation of embedding techniques and large language models (LLMs) for Information Retrieval (IR) and question answering (QA) across languages, focusing on English and Italian. We address a significant research gap by providing empirical evidence of model performance across
[...] Read more.
This study presents a comprehensive evaluation of embedding techniques and large language models (LLMs) for Information Retrieval (IR) and question answering (QA) across languages, focusing on English and Italian. We address a significant research gap by providing empirical evidence of model performance across linguistic boundaries. We evaluate 12 embedding models on diverse IR datasets, including Italian SQuAD and DICE, English SciFact, ArguAna, and NFCorpus. We assess four LLMs (GPT4o, LLama-3.1 8B, Mistral-Nemo, and Gemma-2b) for QA tasks within a retrieval-augmented generation (RAG) pipeline. We evaluate them on SQuAD, CovidQA, and NarrativeQA datasets, including cross-lingual scenarios. The results show multilingual models perform more competitively than language-specific ones. The embed-multilingual-v3.0 model achieves top nDCG@10 scores of 0.90 for English and 0.86 for Italian. In QA evaluation, Mistral-Nemo demonstrates superior answer relevance (0.91–1.0) while maintaining strong groundedness (0.64–0.78). Our analysis reveals three key findings: (1) multilingual embedding models effectively bridge performance gaps between English and Italian, though performance consistency decreases in specialized domains, (2) model size does not consistently predict performance, and (3) all evaluated QA systems exhibit a critical trade-off between answer relevance and factual groundedness. Our evaluation framework combines traditional metrics with innovative LLM-based assessment techniques. It establishes new benchmarks for multilingual language technologies while providing actionable insights for real-world IR and QA system deployment.
Full article
(This article belongs to the Special Issue Advances in Natural Language Processing and Text Mining)
►▼
Show Figures

Figure 1
Open AccessArticle
Polarity of Yelp Reviews: A BERT–LSTM Comparative Study
by
Rachid Belaroussi, Sié Cyriac Noufe, Francis Dupin and Pierre-Olivier Vandanjon
Big Data Cogn. Comput. 2025, 9(5), 140; https://doi.org/10.3390/bdcc9050140 - 21 May 2025
Abstract
With the rapid growth in social network comments, the need for more effective methods to classify their polarity—negative, neutral, or positive—has become essential. Sentiment analysis, powered by natural language processing, has evolved significantly with the adoption of advanced deep learning techniques. Long Short-Term
[...] Read more.
With the rapid growth in social network comments, the need for more effective methods to classify their polarity—negative, neutral, or positive—has become essential. Sentiment analysis, powered by natural language processing, has evolved significantly with the adoption of advanced deep learning techniques. Long Short-Term Memory networks capture long-range dependencies in text, while transformers, with their attention mechanisms, excel at preserving contextual meaning and handling high-dimensional, semantically complex data. This study compares the performance of sentiment analysis models based on LSTM and BERT architectures using key evaluation metrics. The dataset consists of business reviews from the Yelp Open Dataset. We tested LSTM-based methods against BERT and its variants—RoBERTa, BERTweet, and DistilBERT—leveraging popular pipelines from the Hugging Face Hub. A class-by-class performance analysis is presented, revealing that more complex BERT-based models do not always guarantee superior results in the classification of Yelp reviews. Additionally, the use of bidirectionality in LSTMs does not necessarily lead to better performance. However, across a diversity of test sets, transformer models outperform traditional RNN-based models, as their generalization capability is greater than that of a simple LSTM model.
Full article
(This article belongs to the Special Issue Advances in Natural Language Processing and Text Mining)
►▼
Show Figures

Figure 1
Open AccessArticle
Tri-Collab: A Machine Learning Project to Leverage Innovation Ecosystems in Portugal
by
Ângelo Marujo, Bruno Afonso, Inês Martins, Lisandro Pires and Sílvia Fernandes
Big Data Cogn. Comput. 2025, 9(5), 139; https://doi.org/10.3390/bdcc9050139 - 20 May 2025
Abstract
►▼
Show Figures
This project consists of a digital platform named Tri-Collab, where investors, entrepreneurs, and other agents (mainly talents) can cooperate on their ideas and eventually co-create. It is a digital means for this triad of actors (among other potential ones) to better adjust their
[...] Read more.
This project consists of a digital platform named Tri-Collab, where investors, entrepreneurs, and other agents (mainly talents) can cooperate on their ideas and eventually co-create. It is a digital means for this triad of actors (among other potential ones) to better adjust their requirements. It includes an app that easily communicates with a database of projects, innovation agents and their profiles, and the originality lies in the matching algorithm. Thus, co-creation can have better support through this assertive interconnection of players and their resources. This work also highlights the usefulness of the Canvas Business Model in structuring the idea and its dashboard, allowing a comprehensive view of channels, challenges and gains. Also, the potential of machine learning in improving matchmaking platforms is discussed, especially when technological advancements allow for forecasts and match people at scale.
Full article

Figure 1
Open AccessArticle
A Comparative Study of Ensemble Machine Learning and Explainable AI for Predicting Harmful Algal Blooms
by
Omer Mermer, Eddie Zhang and Ibrahim Demir
Big Data Cogn. Comput. 2025, 9(5), 138; https://doi.org/10.3390/bdcc9050138 - 20 May 2025
Abstract
Harmful algal blooms (HABs), driven by environmental pollution, pose significant threats to water quality, public health, and aquatic ecosystems. This study enhances the prediction of HABs in Lake Erie, part of the Great Lakes system, by utilizing ensemble machine learning (ML) models coupled
[...] Read more.
Harmful algal blooms (HABs), driven by environmental pollution, pose significant threats to water quality, public health, and aquatic ecosystems. This study enhances the prediction of HABs in Lake Erie, part of the Great Lakes system, by utilizing ensemble machine learning (ML) models coupled with explainable artificial intelligence (XAI) for interpretability. Using water quality data from 2013 to 2020, various physical, chemical, and biological parameters were analyzed to predict chlorophyll-a (Chl-a) concentrations, which are a commonly used indicator of phytoplankton biomass and a proxy for algal blooms. This study employed multiple ensemble ML models, including random forest (RF), deep forest (DF), gradient boosting (GB), and XGBoost, and compared their performance against individual models, such as support vector machine (SVM), decision tree (DT), and multi-layer perceptron (MLP). The findings revealed that the ensemble models, particularly XGBoost and deep forest (DF), achieved superior predictive accuracy, with R2 values of 0.8517 and 0.8544, respectively. The application of SHapley Additive exPlanations (SHAPs) provided insights into the relative importance of the input features, identifying the particulate organic nitrogen (PON), particulate organic carbon (POC), and total phosphorus (TP) as the critical factors influencing the Chl-a concentrations. This research demonstrates the effectiveness of ensemble ML models for achieving high predictive accuracy, while the integration of XAI enhances model interpretability. The results support the development of proactive water quality management strategies and highlight the potential of advanced ML techniques for environmental monitoring.
Full article
(This article belongs to the Special Issue Machine Learning Applications and Big Data Challenges)
►▼
Show Figures

Figure 1
Open AccessArticle
The Development of Small-Scale Language Models for Low-Resource Languages, with a Focus on Kazakh and Direct Preference Optimization
by
Nurgali Kadyrbek, Zhanseit Tuimebayev, Madina Mansurova and Vítor Viegas
Big Data Cogn. Comput. 2025, 9(5), 137; https://doi.org/10.3390/bdcc9050137 - 20 May 2025
Abstract
►▼
Show Figures
Low-resource languages remain underserved by contemporary large language models (LLMs) because they lack sizable corpora, bespoke preprocessing tools, and the computing budgets assumed by mainstream alignment pipelines. Focusing on Kazakh, we present a 1.94B parameter LLaMA-based model that demonstrates how strong, culturally aligned
[...] Read more.
Low-resource languages remain underserved by contemporary large language models (LLMs) because they lack sizable corpora, bespoke preprocessing tools, and the computing budgets assumed by mainstream alignment pipelines. Focusing on Kazakh, we present a 1.94B parameter LLaMA-based model that demonstrates how strong, culturally aligned performance can be achieved without massive infrastructure. The contribution is threefold. (i) Data and tokenization—we compile a rigorously cleaned, mixed-domain Kazakh corpus and design a tokenizer that respects the language’s agglutinative morphology, mixed-script usage, and diacritics. (ii) Training recipe—the model is built in two stages: causal language modeling from scratch followed by instruction tuning. Alignment is further refined with Direct Preference Optimization (DPO), extended by contrastive and entropy-based regularization to stabilize training under sparse, noisy preference signals. Two complementary resources support this step: ChatTune-DPO, a crowd-sourced set of human preference pairs, and Pseudo-DPO, an automatically generated alternative that repurposes instruction data to reduce annotation cost. (iii) Evaluation and impact—qualitative and task-specific assessments show that targeted monolingual training and the proposed DPO variant markedly improve factuality, coherence, and cultural fidelity over baseline instruction-only and multilingual counterparts. The model and datasets are released under open licenses, offering a reproducible blueprint for extending state-of-the-art language modeling to other under-represented languages and domains.
Full article

Figure 1
Open AccessArticle
Helium Speech Recognition Method Based on Spectrogram with Deep Learning
by
Yonghong Chen, Shibing Zhang and Dongmei Li
Big Data Cogn. Comput. 2025, 9(5), 136; https://doi.org/10.3390/bdcc9050136 - 20 May 2025
Abstract
►▼
Show Figures
With the development of the marine economy and the increase in marine activities, deep saturation diving has gained significant attention. Helium speech communication is indispensable for saturation diving operations and is a critical technology for deep saturation diving, serving as the sole communication
[...] Read more.
With the development of the marine economy and the increase in marine activities, deep saturation diving has gained significant attention. Helium speech communication is indispensable for saturation diving operations and is a critical technology for deep saturation diving, serving as the sole communication method to ensure the smooth execution of such operations. This study introduces deep learning into helium speech recognition and proposes a spectrogram-based dual-model helium speech recognition method. First, we extract the spectrogram features from the helium speech. Then, we combine a deep fully convolutional neural network with connectionist temporal classification (CTC) to form an acoustic model, in which the spectrogram features of helium speech are used as an input to convert speech signals into phonetic sequences. Finally, a maximum entropy hidden Markov model (MEMM) is employed as the language model to convert the phonetic sequences to word outputs, which is regarded as a dynamic programming problem. We use a Viterbi algorithm to find the optimal path to decode the phonetic sequences to word sequences. The simulation results show that the method can effectively recognize helium speech with a recognition rate of 97.89% for isolated words and 95.99% for continuous helium speech.
Full article

Figure 1
Open AccessArticle
Applying Big Data for Maritime Accident Risk Assessment: Insights, Predictive Insights and Challenges
by
Vicky Zampeta, Gregory Chondrokoukis and Dimosthenis Kyriazis
Big Data Cogn. Comput. 2025, 9(5), 135; https://doi.org/10.3390/bdcc9050135 - 19 May 2025
Abstract
►▼
Show Figures
Maritime safety is a critical concern for the transport sector and remains a key challenge for the international shipping industry. Recognizing that maritime accidents pose significant risks to both safety and operational efficiency, this study explores the application of big data analysis techniques
[...] Read more.
Maritime safety is a critical concern for the transport sector and remains a key challenge for the international shipping industry. Recognizing that maritime accidents pose significant risks to both safety and operational efficiency, this study explores the application of big data analysis techniques to understand the factors influencing maritime transport accidents (MTA). Specifically, using extensive datasets derived from vessel performance measurements, environmental conditions, and accident reports, it seeks to identify the key intrinsic and extrinsic factors contributing to maritime accidents. The research examines more than 90 thousand incidents for the period 2014–2022. Leveraging big data analytics and advanced statistical techniques, the findings reveal significant correlations between vessel size, speed, and specific environmental factors. Furthermore, the study highlights the potential of big data analytics in enhancing predictive modeling, real-time risk assessment, and decision-making processes for maritime traffic management. The integration of big data with intelligent transportation systems (ITSs) can optimize safety strategies, improve accident prevention mechanisms, and enhance the resilience of ocean-going transportation systems. By bridging the gap between big data applications and maritime safety research, this work contributes to the literature by emphasizing the importance of examining both intrinsic and extrinsic factors in predicting maritime accident risks. Additionally, it underscores the transformative role of big data in shaping safer and more efficient waterway transportation systems.
Full article

Figure 1

Journal Menu
► ▼ Journal Menu-
- BDCC Home
- Aims & Scope
- Editorial Board
- Reviewer Board
- Topical Advisory Panel
- Instructions for Authors
- Special Issues
- Topics
- Topical Collections
- Article Processing Charge
- Indexing & Archiving
- Editor’s Choice Articles
- Most Cited & Viewed
- Journal Statistics
- Journal History
- Journal Awards
- Editorial Office
Journal Browser
► ▼ Journal BrowserHighly Accessed Articles
Latest Books
E-Mail Alert
News
Topics
Topic in
Applied Sciences, BDCC, Future Internet, Information, Sci
Social Computing and Social Network Analysis
Topic Editors: Carson K. Leung, Fei Hao, Giancarlo Fortino, Xiaokang ZhouDeadline: 30 June 2025
Topic in
AI, BDCC, Fire, GeoHazards, Remote Sensing
AI for Natural Disasters Detection, Prediction and Modeling
Topic Editors: Moulay A. Akhloufi, Mozhdeh ShahbaziDeadline: 25 July 2025
Topic in
Algorithms, BDCC, BioMedInformatics, Information, Mathematics
Machine Learning Empowered Drug Screen
Topic Editors: Teng Zhou, Jiaqi Wang, Youyi SongDeadline: 31 August 2025
Topic in
IJERPH, JPM, Healthcare, BDCC, Applied Sciences, Sensors
eHealth and mHealth: Challenges and Prospects, 2nd Edition
Topic Editors: Antonis Billis, Manuel Dominguez-Morales, Anton CivitDeadline: 31 October 2025

Conferences
Special Issues
Special Issue in
BDCC
Transforming Cyber Security Provision through Utilizing Artificial Intelligence
Guest Editors: Peter R. J. Trim, Yang-Im LeeDeadline: 25 June 2025
Special Issue in
BDCC
Business Intelligence and Big Data in E-commerce
Guest Editors: George Stalidis, Dimitrios KardarasDeadline: 30 June 2025
Special Issue in
BDCC
Blockchain and Cloud Computing in Big Data and Generative AI Era
Guest Editors: Syed Muslim Jameel, Carson K. LeungDeadline: 30 June 2025
Special Issue in
BDCC
Big Data Analytics with Machine Learning for Cyber Security
Guest Editors: Babu Baniya, Sherif Abdelfattah, Deepak GCDeadline: 30 June 2025