Next Issue
Volume 9, March
Previous Issue
Volume 9, January
 
 

Big Data Cogn. Comput., Volume 9, Issue 2 (February 2025) – 31 articles

Cover Story (view full-size image): Android malware detection using AI is essential for preventing cyberattacks. This study applies genetic programming symbolic classifiers (GPSCs) to extract symbolic expressions (SEs) that classify malware. To optimize GPSC hyperparameters, a random hyperparameter value search (RHVS) and 5-fold cross-validation (5FCV) were used. The highly imbalanced dataset was balanced using preprocessing and oversampling techniques. Three approaches were tested: all input variables, high-importance features, and PCA. The best SEs formed threshold-based voting ensembles (TBVEs), achieving a peak accuracy of 0.98. View this paper
  • Issues are regarded as officially published after their release is announced to the table of contents alert mailing list.
  • You may sign up for e-mail alerts to receive table of contents of newly released issues.
  • PDF is the official format for papers published in both, html and pdf forms. To view the papers in pdf format, click on the "PDF Full-text" link, and use the free Adobe Reader to open them.
Order results
Result details
Select all
Export citation of selected articles as:
23 pages, 1421 KiB  
Article
EmoBERTa-X: Advanced Emotion Classifier with Multi-Head Attention and DES for Multilabel Emotion Classification
by Farah Hassan Labib, Mazen Elagamy and Sherine Nagy Saleh
Big Data Cogn. Comput. 2025, 9(2), 48; https://doi.org/10.3390/bdcc9020048 - 19 Feb 2025
Cited by 1 | Viewed by 943
Abstract
The rising prevalence of social media turns them into huge, rich repositories of human emotions. Understanding and categorizing human emotion from social media content is of fundamental importance for many reasons, such as improvement of user experience, monitoring of public sentiment, support for [...] Read more.
The rising prevalence of social media turns them into huge, rich repositories of human emotions. Understanding and categorizing human emotion from social media content is of fundamental importance for many reasons, such as improvement of user experience, monitoring of public sentiment, support for mental health, and enhancement of focused marketing strategies. However, social media text is often unstructured and ambiguous; hence, extracting meaningful emotional information is difficult. Thus, effective emotion classification needs advanced techniques. This article proposes a novel model, EmoBERTa-X, to enhance performance in multilabel emotion classification, particularly in informal and ambiguous social media texts. Attention mechanisms combined with ensemble learning, supported by preprocessing steps, help in avoiding issues such as class imbalance of the dataset, ambiguity in short texts, and the inherent complexities of multilabel classification. The experimental results on the GoEmotions dataset indicate that EmoBERTa-X has outperformed state-of-the-art models on fine-grained emotion-detection tasks in social media expressions with an accuracy increase of 4.32% over some popular approaches. Full article
(This article belongs to the Special Issue Advances in Natural Language Processing and Text Mining)
Show Figures

Figure 1

21 pages, 3633 KiB  
Article
Reusing ML Models in Dynamic Data Environments: Data Similarity-Based Approach for Efficient MLOps
by Eduardo Peixoto, Diogo Torres, Davide Carneiro, Bruno Silva and Ruben Marques
Big Data Cogn. Comput. 2025, 9(2), 47; https://doi.org/10.3390/bdcc9020047 - 19 Feb 2025
Viewed by 596
Abstract
The rapid integration of Machine Learning (ML) in organizational practices has driven demand for substantial computational resources, incurring both high economic costs and environmental impact, particularly from energy consumption. This challenge is amplified in dynamic data environments, where ML models must be frequently [...] Read more.
The rapid integration of Machine Learning (ML) in organizational practices has driven demand for substantial computational resources, incurring both high economic costs and environmental impact, particularly from energy consumption. This challenge is amplified in dynamic data environments, where ML models must be frequently retrained to adapt to evolving data patterns. To address this, more sustainable Machine Learning Operations (MLOps) pipelines are needed for reducing environmental impacts while maintaining model accuracy. In this paper, we propose a model reuse approach based on data similarity metrics, which allows organizations to leverage previously trained models where applicable. We introduce a tailored set of meta-features to characterize data windows, enabling efficient similarity assessment between historical and new data. The effectiveness of the proposed method is validated across multiple ML tasks using the cosine and Bray–Curtis distance functions, which evaluate both model reuse rates and the performance of reused models relative to newly trained alternatives. The results indicate that the proposed approach can reduce the frequency of model retraining by up to 70% to 90% while maintaining or even improving predictive performance, contributing to more resource-efficient and sustainable MLOps practices. Full article
Show Figures

Figure 1

18 pages, 307 KiB  
Article
Who Will Author the Synthetic Texts? Evoking Multiple Personas from Large Language Models to Represent Users’ Associative Thesauri
by Maxim Bakaev, Svetlana Gorovaia and Olga Mitrofanova
Big Data Cogn. Comput. 2025, 9(2), 46; https://doi.org/10.3390/bdcc9020046 - 18 Feb 2025
Viewed by 731
Abstract
Previously, it was suggested that the “persona-driven” approach can contribute to producing sufficiently diverse synthetic training data for Large Language Models (LLMs) that are currently about to run out of real natural language texts. In our paper, we explore whether personas evoked from [...] Read more.
Previously, it was suggested that the “persona-driven” approach can contribute to producing sufficiently diverse synthetic training data for Large Language Models (LLMs) that are currently about to run out of real natural language texts. In our paper, we explore whether personas evoked from LLMs through HCI-style descriptions could indeed imitate human-like differences in authorship. For this end, we ran an associative experiment with 50 human participants and four artificial personas evoked from the popular LLM-based services: GPT-4(o) and YandexGPT Pro. For each of the five stimuli words selected from university websites’ homepages, we asked both groups of subjects to come up with 10 short associations (in Russian). We then used cosine similarity and Mahalanobis distance to measure the distance between the association lists produced by different humans and personas. While the difference in the similarity was significant for different human associators and different gender and age groups, neither was the case for the different personas evoked from ChatGPT or YandexGPT. Our findings suggest that the LLM-based services so far fall short at imitating the associative thesauri of different authors. The outcome of our study might be of interest to computer linguists, as well as AI/ML scientists and prompt engineers. Full article
29 pages, 11018 KiB  
Article
Impact on Classification Process Generated by Corrupted Features
by Simona Moldovanu, Dan Munteanu and Carmen Sîrbu
Big Data Cogn. Comput. 2025, 9(2), 45; https://doi.org/10.3390/bdcc9020045 - 18 Feb 2025
Viewed by 602
Abstract
The topic of this study is the testing of the robustness of machine learning (ML) and neural network (NN) models with a new idea based on corrupted data. Typically, ML and NN classifiers are trained on real feature data; however, a portion of [...] Read more.
The topic of this study is the testing of the robustness of machine learning (ML) and neural network (NN) models with a new idea based on corrupted data. Typically, ML and NN classifiers are trained on real feature data; however, a portion of the features may be false, with noise, or incorrect. The undesired content was analyzed in eight experiments with false data, six with feature noise, and six with label noise. These tests were all conducted on the public Breast Cancer Wisconsin Dataset (BCWD). Throughout this, the false and noise data were gradually corrupted in a random way, generating new data and replacing raw features that belonged to the BCWD. Artificial Intelligence (AI) should be properly selected while categorizing different diseases using medical data. The Pearson correlation coefficient (PCC) applied between features monitored their correlation in each experiment, and a correlation matrix between both true and false features was used. Four machine learning (ML) algorithms—Random Forest (RF), XGBClassifier (XGB), K-Nearest Neighbors (KNN), and Support Vector Machine (SVM)—were used, as well as for the analysis of important features (IF) and the binary classification. The study was completed using three deep neural networks—a simple Deep Neural Network (DNN), a Convolutional Neural Network (CNN), and a Transformer Neural Network (TNN). In the context of a binary classification, the accuracy, F1-score, Area Under the Curve (AUC), and Matthews correlation coefficient (MCC) metrics of the performance of classification in malignant versus benign breast cancer (BC) was computed. The results demonstrated the robustness of some methods and the sensitivity of other machine learning algorithms in the context of corrupted data, computational cost, and hyperparameters optimization. Full article
Show Figures

Figure 1

32 pages, 498 KiB  
Review
A Survey on the Applications of Cloud Computing in the Industrial Internet of Things
by Elias Dritsas and Maria Trigka
Big Data Cogn. Comput. 2025, 9(2), 44; https://doi.org/10.3390/bdcc9020044 - 17 Feb 2025
Cited by 1 | Viewed by 2537
Abstract
The convergence of cloud computing and the Industrial Internet of Things (IIoT) has significantly transformed industrial operations, enabling intelligent, scalable, and efficient systems. This survey provides a comprehensive analysis of the role cloud computing plays in IIoT ecosystems, focusing on its architectural frameworks, [...] Read more.
The convergence of cloud computing and the Industrial Internet of Things (IIoT) has significantly transformed industrial operations, enabling intelligent, scalable, and efficient systems. This survey provides a comprehensive analysis of the role cloud computing plays in IIoT ecosystems, focusing on its architectural frameworks, service models, and application domains. By leveraging centralized, edge, and hybrid cloud architectures, IIoT systems achieve enhanced real-time processing capabilities, streamlined data management, and optimized resource allocation. Moreover, this study delves into integrating artificial intelligence (AI) and machine learning (ML) in cloud platforms to facilitate predictive analytics, anomaly detection, and operational intelligence in IIoT environments. Security challenges, including secure device-to-cloud communication and privacy concerns, are addressed with innovative solutions like blockchain and AI-powered intrusion detection systems. Future trends, such as adopting 5G, serverless computing, and AI-driven adaptive services, are also discussed, offering a forward-looking perspective on this rapidly evolving domain. Finally, this survey contributes to a well-rounded understanding of cloud computing’s multifaceted aspects and highlights its pivotal role in driving the next generation of industrial innovation and operational excellence. Full article
(This article belongs to the Special Issue Application of Cloud Computing in Industrial Internet of Things)
Show Figures

Figure 1

12 pages, 2665 KiB  
Article
Association Between Mastication Pattern, Periodontal Condition, and Cognitive Condition—Investigation Using Large Database of Japanese Universal Healthcare System
by Takahiko Shiba, Daisuke Sasaki, Juanna Xie, Chia-Yu Chen, Hiroyuki Tanaka and Shigemi Nagai
Big Data Cogn. Comput. 2025, 9(2), 43; https://doi.org/10.3390/bdcc9020043 - 17 Feb 2025
Viewed by 614
Abstract
The decline in oral health commonly occurs as a natural consequence of aging or due to various pathological factors. Tooth loss, which diminishes masticatory ability, has been associated with negative impacts on cognitive function. This observational study analyzed dental and medical records from [...] Read more.
The decline in oral health commonly occurs as a natural consequence of aging or due to various pathological factors. Tooth loss, which diminishes masticatory ability, has been associated with negative impacts on cognitive function. This observational study analyzed dental and medical records from Japan’s Universal Healthcare System (UHCS) national database to investigate the relationship between cognitive and oral disorders, focusing on periodontitis and decreased tooth-to-tooth contact between the maxillary and mandibular arches. A descriptive data analysis evaluated diagnostic codes for Alzheimer’s disease and cognitive impairment alongside dental treatment records from 2013 to 2018. The odds ratios for cognitive impairment in patients with partial loss of natural tooth contact were 1.6663 (p < 0.05) for early elderly individuals (aged 65–75) and 1.5003 (p < 0.0001) for advanced elderly individuals (over 75). Periodontally compromised patients had higher odds, with ratios of 1.3936 (p < 0.0001) for early elderly individuals and 1.1888 (p < 0.00001) for advanced elderly individuals, compared to their periodontally healthy counterparts. These findings suggest a potential link between cognitive health, natural tooth contact preservation, and periodontitis, with the loss of natural tooth contacts having the most significant impact on cognitive function. Full article
Show Figures

Figure 1

24 pages, 63326 KiB  
Article
Exploration of Generative Neural Networks for Police Facial Sketches
by Nerea Sádaba-Campo and Hilario Gómez-Moreno
Big Data Cogn. Comput. 2025, 9(2), 42; https://doi.org/10.3390/bdcc9020042 - 14 Feb 2025
Viewed by 1487
Abstract
This article addresses the impact of generative artificial intelligence on the creation of composite sketches for police investigations. The automation of this task, traditionally performed through artistic methods or image composition, has become a challenge that can be tackled with generative neural networks. [...] Read more.
This article addresses the impact of generative artificial intelligence on the creation of composite sketches for police investigations. The automation of this task, traditionally performed through artistic methods or image composition, has become a challenge that can be tackled with generative neural networks. In this context, technologies such as Generative Adversarial Networks, Variational Autoencoders, and Diffusion Models are analyzed. The study also focuses on the use of advanced tools like DALL-E, Midjourney, and primarily Stable Diffusion, which enable the generation of highly detailed and realistic facial images from textual descriptions or sketches and allow for rapid and precise morphofacial modifications. Additionally, the study explores the capacity of these tools to interpret user-provided facial feature descriptions and adjust the generated results accordingly. The article concludes that these technologies have the potential to automate the composite sketch creation process. Therefore, their integration could not only expedite this process but also enhance its accuracy and utility in the identification of suspects or missing persons, representing a groundbreaking advancement in the field of criminal investigation. Full article
Show Figures

Figure 1

19 pages, 867 KiB  
Article
Exploring the Boundaries Between LLM Code Clone Detection and Code Similarity Assessment on Human and AI-Generated Code
by Zixian Zhang and Takfarinas Saber
Big Data Cogn. Comput. 2025, 9(2), 41; https://doi.org/10.3390/bdcc9020041 - 13 Feb 2025
Cited by 1 | Viewed by 1894
Abstract
As Large Language Models (LLMs) continue to advance, their capabilities in code clone detection have garnered significant attention. While much research has assessed LLM performance on human-generated code, the proliferation of LLM-generated code raises critical questions about their ability to detect clones across [...] Read more.
As Large Language Models (LLMs) continue to advance, their capabilities in code clone detection have garnered significant attention. While much research has assessed LLM performance on human-generated code, the proliferation of LLM-generated code raises critical questions about their ability to detect clones across both human- and LLM-created codebases, as this capability remains largely unexplored. This paper addresses this gap by evaluating two versions of LLaMA3 on these distinct types of datasets. Additionally, we perform a deeper analysis beyond simple prompting, examining the nuanced relationship between code cloning and code similarity that LLMs infer. We further explore how fine-tuning impacts LLM performance in clone detection, offering new insights into the interplay between code clones and similarity in human versus AI-generated code. Our findings reveal that LLaMA models excel in detecting syntactic clones but face challenges with semantic clones. Notably, the models perform better on LLM-generated datasets for semantic clones, suggesting a potential bias. The fine-tuning technique enhances the ability of LLMs to comprehend code semantics, improving their performance in both code clone detection and code similarity assessment. Our results offer valuable insights into the effectiveness and characteristics of LLMs in clone detection and code similarity assessment, providing a foundation for future applications and guiding further research in this area. Full article
(This article belongs to the Special Issue Advances in Natural Language Processing and Text Mining)
Show Figures

Figure 1

23 pages, 1994 KiB  
Article
Artificial Intelligence in Digital Marketing: Towards an Analytical Framework for Revealing and Mitigating Bias
by Catherine Reed, Martin Wynn and Robin Bown
Big Data Cogn. Comput. 2025, 9(2), 40; https://doi.org/10.3390/bdcc9020040 - 12 Feb 2025
Viewed by 3100
Abstract
Artificial intelligence (AI) affects many aspects of modern life, and most predictions are that the impact of AI on business and society will only increase. In the marketing function of today’s leading businesses, two main types of AI can be discerned. Traditional AI [...] Read more.
Artificial intelligence (AI) affects many aspects of modern life, and most predictions are that the impact of AI on business and society will only increase. In the marketing function of today’s leading businesses, two main types of AI can be discerned. Traditional AI centres on supervised learning algorithms to support and enable the application of data rules, predictive functionality and other AI-based features. Generative AI, on the other hand, uses large language model (LLM) data sets and user prompts to generate new content. While AI-generated applications and content can boost efficiency, they also present challenges regarding transparency and authenticity, and the question of bias is central to these concerns. This article adopts a qualitative inductive approach to research this issue in the context of the marketing function of a global software supplier. Based on a systematic literature review and in-depth interviews with company marketeers, the perceived bias issues in coding, prompting and deployment of AI in digital marketing are identified. Then, based on a provisional conceptual framework derived from the extant literature, an analytical framework for revealing and mitigating bias in digital marketing is put forward, incorporating the perspectives of industry-based practitioners. The framework can be used as a checklist of marketing activities in which bias may exist in either traditional or generative AI across different stages of the customer journey. The article thus contributes to the development of theory and practice regarding the management of bias in AI-generated content and will be of interest to researchers and practitioners as an operational guide and point of departure for subsequent studies. Full article
Show Figures

Figure 1

24 pages, 16681 KiB  
Article
A Deep Ensemble Learning Approach Based on a Vision Transformer and Neural Network for Multi-Label Image Classification
by Anas W. Abulfaraj and Faisal Binzagr
Big Data Cogn. Comput. 2025, 9(2), 39; https://doi.org/10.3390/bdcc9020039 - 11 Feb 2025
Viewed by 1316
Abstract
Convolutional Neural Networks (CNNs) have proven to be very effective in image classification due to their status as a powerful feature learning algorithm. Traditional approaches have considered the problem of multiclass classification, where the goal is to classify a set of objects at [...] Read more.
Convolutional Neural Networks (CNNs) have proven to be very effective in image classification due to their status as a powerful feature learning algorithm. Traditional approaches have considered the problem of multiclass classification, where the goal is to classify a set of objects at once. However, co-occurrence can make the discriminative features of the target less salient and may lead to overfitting of the model, resulting in lower performance. To address this, we propose a multi-label classification ensemble model including a Vision Transformer (ViT) and CNN for directly detecting one or multiple objects in an image. First, we improve the MobileNetV2 and DenseNet201 models using extra convolutional layers to strengthen image classification. In detail, three convolution layers are applied in parallel at the end of both models. ViT can learn dependencies among distant positions and local detail, making it an effective tool for multi-label classification. Finally, an ensemble learning algorithm is used to combine the classification predictions of the ViT, the modified MobileNetV2, and DenseNet201 bands for increased image classification accuracy using a voting system. The performance of the proposed model is examined on four benchmark datasets, achieving accuracies of 98.24%, 98.89%, 99.91%, and 96.69% on ASCAL VOC 2007, PASCAL VOC 2012, MS-COCO, and NUS-WIDE 318, respectively, showing that our framework can enhance current state-of-the-art methods. Full article
Show Figures

Figure 1

25 pages, 13698 KiB  
Article
Self-Supervised Foundation Model for Template Matching
by Anton Hristov, Dimo Dimov and Maria Nisheva-Pavlova
Big Data Cogn. Comput. 2025, 9(2), 38; https://doi.org/10.3390/bdcc9020038 - 11 Feb 2025
Viewed by 1059
Abstract
Finding a template location in a query image is a fundamental problem in many computer vision applications, such as localization of known objects, image registration, image matching, and object tracking. Currently available methods fail when insufficient training data are available or big variations [...] Read more.
Finding a template location in a query image is a fundamental problem in many computer vision applications, such as localization of known objects, image registration, image matching, and object tracking. Currently available methods fail when insufficient training data are available or big variations in the textures, different modalities, and weak visual features exist in the images, leading to limited applications on real-world tasks. We introduce Self-Supervised Foundation Model for Template Matching (Self-TM), a novel end-to-end approach to self-supervised learning template matching. The idea behind Self-TM is to learn hierarchical features incorporating localization properties from images without any annotations. As going deeper in the convolutional neural network (CNN) layers, their filters begin to react to more complex structures and their receptive fields increase. This leads to loss of localization information in contrast to the early layers. The hierarchical propagation of the last layers back to the first layer results in precise template localization. Due to its zero-shot generalization capabilities on tasks such as image retrieval, dense template matching, and sparse image matching, our pre-trained model can be classified as a foundation one. Full article
(This article belongs to the Special Issue Perception and Detection of Intelligent Vision)
Show Figures

Figure 1

21 pages, 1753 KiB  
Article
Explainable Deep Learning for COVID-19 Vaccine Sentiment in Arabic Tweets Using Multi-Self-Attention BiLSTM with XLNet
by Asmaa Hashem Sweidan, Nashwa El-Bendary, Shereen A. Taie, Amira M. Idrees and Esraa Elhariri
Big Data Cogn. Comput. 2025, 9(2), 37; https://doi.org/10.3390/bdcc9020037 - 10 Feb 2025
Viewed by 923
Abstract
The COVID-19 pandemic has generated a vast corpus of online conversations regarding vaccines, predominantly on social media platforms like X (formerly known as Twitter). However, analyzing sentiment in Arabic text is challenging due to the diverse dialects and lack of readily available sentiment [...] Read more.
The COVID-19 pandemic has generated a vast corpus of online conversations regarding vaccines, predominantly on social media platforms like X (formerly known as Twitter). However, analyzing sentiment in Arabic text is challenging due to the diverse dialects and lack of readily available sentiment analysis resources for the Arabic language. This paper proposes an explainable Deep Learning (DL) approach designed for sentiment analysis of Arabic tweets related to COVID-19 vaccinations. The proposed approach utilizes a Bidirectional Long Short-Term Memory (BiLSTM) network with Multi-Self-Attention (MSA) mechanism for capturing contextual impacts over long spans within the tweets, while having the sequential nature of Arabic text constructively learned by the BiLSTM model. Moreover, the XLNet embeddings are utilized to feed contextual information into the model. Subsequently, two essential Explainable Artificial Intelligence (XAI) methods, namely Local Interpretable Model-Agnostic Explanations (LIME) and SHapley Additive exPlanations (SHAP), have been employed for gaining further insights into the features’ contributions to the overall model performance and accordingly achieving reasonable interpretation of the model’s output. Obtained experimental results indicate that the combined XLNet with BiLSTM model outperforms other implemented state-of-the-art methods, achieving an accuracy of 93.2% and an F-measure of 92% for average sentiment classification. The integration of LIME and SHAP techniques not only enhanced the model’s interpretability, but also provided detailed insights into the factors that influence the classification of emotions. These findings underscore the model’s effectiveness and reliability for sentiment analysis in low-resource languages such as Arabic. Full article
Show Figures

Figure 1

34 pages, 8053 KiB  
Article
Novel Extreme-Lightweight Fully Convolutional Network for Low Computational Cost in Microbiological and Cell Analysis: Detection, Quantification, and Segmentation
by Juan A. Ramirez-Quintana, Edgar A. Salazar-Gonzalez, Mario I. Chacon-Murguia and Carlos Arzate-Quintana
Big Data Cogn. Comput. 2025, 9(2), 36; https://doi.org/10.3390/bdcc9020036 - 9 Feb 2025
Viewed by 746
Abstract
Integrating deep learning into microbiological and cell analysis from microscopic image samples has gained significant attention in recent years, driven by the rise of novel medical technologies and pressing global health challenges. Numerous methods for segmentation and classification in microscopic images have emerged [...] Read more.
Integrating deep learning into microbiological and cell analysis from microscopic image samples has gained significant attention in recent years, driven by the rise of novel medical technologies and pressing global health challenges. Numerous methods for segmentation and classification in microscopic images have emerged in the literature. However, key challenges persist due to the limited development of specialized deep learning models to accurately detect and quantify microorganisms and cells from microscopic samples. In response to this gap, this paper introduces MBnet, an Extreme-Lightweight Neural Network for Microbiological and Cell Analysis. MBnet is a binary segmentation method based on a Fully Convolutional Network designed to detect and quantify microorganisms and cells, featuring a low computational cost architecture with only 575 parameters. Its innovative design includes a foreground module and an encoder–decoder structure composed of traditional, depthwise, and separable convolution layers. These layers integrate color, orientation, and morphological features to generate an understanding of different contexts in microscopic sample images for binary segmentation. Experiments were conducted using datasets containing bacteria, yeast, and blood cells. The results suggest that MBnet outperforms other popular networks in the literature in counting, detecting, and segmenting cells and unicellular microorganisms. These findings underscore the potential of MBnet as a highly efficient solution for real-world applications in health monitoring and bioinformatics. Full article
Show Figures

Graphical abstract

16 pages, 2152 KiB  
Article
Enhancing the FFT-LSTM Time-Series Forecasting Model via a Novel FFT-Based Feature Extraction–Extension Scheme
by Kyrylo Yemets, Ivan Izonin and Ivanna Dronyuk
Big Data Cogn. Comput. 2025, 9(2), 35; https://doi.org/10.3390/bdcc9020035 - 8 Feb 2025
Viewed by 1823
Abstract
The importance of enhancing the accuracy of time-series forecasting using artificial intelligence tools is increasingly critical in light of the rapid advancements in modern technologies, particularly deep learning and neural networks. These approaches have already shown considerable advantages over traditional methods, especially due [...] Read more.
The importance of enhancing the accuracy of time-series forecasting using artificial intelligence tools is increasingly critical in light of the rapid advancements in modern technologies, particularly deep learning and neural networks. These approaches have already shown considerable advantages over traditional methods, especially due to their capacity to efficiently process large datasets and detect complex patterns. A crucial step in the forecasting process is the preprocessing of time-series data, which can greatly improve the training quality of neural networks and the precision of their predictions. This paper introduces a novel preprocessing technique that integrates information from both the time and frequency domains. To achieve this, the authors developed a feature extraction–extension scheme, where the extraction component focuses on obtaining the phase and amplitude of complex numbers through fast Fourier transform (FFT) and the extension component expands the time intervals by enriching them with the corresponding frequency characteristics of each individual time point. Building upon this preprocessing method, the FFT-LSTM forecasting model, which combines the strengths of FFT and Long Short-Term Memory (LSTM) recurrent neural networks, was enhanced. The simulation of the improved FFT-LSTM model was carried out on two time series with distinct characteristics. The results revealed a substantial improvement in forecasting accuracy compared to established methods in this domain, with about a 5% improvement in MAE and RMSE, thereby validating the effectiveness of the proposed approach for forecasting applications across various fields. Full article
Show Figures

Figure 1

13 pages, 3489 KiB  
Article
Does Low Spoilage Under Cold Conditions Foster Cultural Complexity During the Foraging Era?—Agent-Based Modeling and Reinforcement-Learning Approach
by Minhyeok Lee
Big Data Cogn. Comput. 2025, 9(2), 34; https://doi.org/10.3390/bdcc9020034 - 8 Feb 2025
Viewed by 697
Abstract
Human cultural complexity did not arise in a vacuum. This study employs agent-based modeling (ABM) and ecological modeling perspectives, combined with reinforcement-learning techniques, to investigate whether conditions that allowed for the lower spoilage of stored food, often associated with colder climates and abundant [...] Read more.
Human cultural complexity did not arise in a vacuum. This study employs agent-based modeling (ABM) and ecological modeling perspectives, combined with reinforcement-learning techniques, to investigate whether conditions that allowed for the lower spoilage of stored food, often associated with colder climates and abundant large fauna, might have indirectly fostered the emergence of cultural complexity. Specifically, we developed a mathematical framework to capture how spoilage rates, yield levels, resource management skills, and cultural activities interact within a multi-agent system. Under the restrictive constraints, we proved that lower spoilage and adequate yields reduced the frequency of hunting, freeing time for cultural pursuits. We then implemented a reinforcement-learning simulation to validate these predictions by training agents in different (Y,p) environments, where Y is the yield and p is the probability of daily spoilage. Our regression analysis and visualizations showed strong correlations between stable conditions with lower spoilage and higher levels of cultural investment. While we do not claim to replicate prehistoric social realities directly, our findings highlight the potential of ABM and ecological modeling to illuminate how environmental factors influence the allocation of time to complex cultural activities. This work offers a computationally grounded perspective that bridges humanistic inquiries into the origins of culture with formal agent-based methods. Full article
(This article belongs to the Special Issue Recent Advances in Big Data-Driven Prescriptive Analytics)
Show Figures

Figure 1

26 pages, 906 KiB  
Article
Large Language Models as Kuwaiti Annotators
by Hana Alostad
Big Data Cogn. Comput. 2025, 9(2), 33; https://doi.org/10.3390/bdcc9020033 - 8 Feb 2025
Viewed by 805
Abstract
Stance detection for low-resource languages, such as the Kuwaiti dialect, poses a significant challenge in natural language processing (NLP) due to the scarcity of annotated datasets and specialized tools. This study addresses these limitations by evaluating the effectiveness of open large language models [...] Read more.
Stance detection for low-resource languages, such as the Kuwaiti dialect, poses a significant challenge in natural language processing (NLP) due to the scarcity of annotated datasets and specialized tools. This study addresses these limitations by evaluating the effectiveness of open large language models (LLMs) in automating stance detection through zero-shot and few-shot prompt engineering, with a focus on the potential of open-source models to achieve performance levels comparable to those of closed-source alternatives. We also highlight the critical distinctions between zero- and few-shot learning, emphasizing their significance for addressing the challenges posed by low-resource languages. Our evaluation involved testing 11 LLMs on a manually labeled dataset of social media posts, including GPT-4o, Gemini Pro 1.5, Mistral-Large, Jais-30B, and AYA-23. As expected, closed-source models such as GPT-4o, Gemini Pro 1.5, and Mistral-Large demonstrated superior performance, achieving maximum F1 scores of 95.4%, 95.0%, and 93.2%, respectively, in few-shot scenarios with English as the prompt template language. However, open-source models such as Jais-30B and AYA-23 achieved competitive results, with maximum F1 scores of 93.0% and 93.1%, respectively, under the same conditions. Furthermore, statistical analysis using ANOVA and Tukey’s HSD post hoc tests revealed no significant differences in overall performance among GPT-4o, Gemini Pro 1.5, Mistral-Large, Jais-30B, and AYA-23. This finding underscores the potential of open-source LLMs as cost-effective and privacy-preserving alternatives for low-resource language annotation. This is the first study comparing LLMs for stance detection in the Kuwaiti dialect. Our findings highlight the importance of prompt design and model consistency in improving the quality of annotations and pave the way for NLP solutions for under-represented Arabic dialects. Full article
(This article belongs to the Special Issue Generative AI and Large Language Models)
Show Figures

Figure 1

30 pages, 9373 KiB  
Article
Dependency Reduction Techniques for Performance Improvement of Hyperledger Fabric Blockchain
by Ju-Won Kim, Jae-Geun Song, In-Hwan Park, Dong-Hwan Jo, Yong-Jin Kim and Ju-Wook Jang
Big Data Cogn. Comput. 2025, 9(2), 32; https://doi.org/10.3390/bdcc9020032 - 7 Feb 2025
Viewed by 909
Abstract
We propose dependency reduction techniques for the performance enhancement of the Hyperledger Fabric blockchain. A dependency hazard may result from the parallelism in Hyperledger Fabric, which executes multiple transactions simultaneously in a single block. Since multiple transactions in a block are executed in [...] Read more.
We propose dependency reduction techniques for the performance enhancement of the Hyperledger Fabric blockchain. A dependency hazard may result from the parallelism in Hyperledger Fabric, which executes multiple transactions simultaneously in a single block. Since multiple transactions in a block are executed in parallel for throughput enhancement, dependency problems may arise among transactions involving the same key (If Z = A + D is executed in parallel with A = B + C, a read-after-write hazard for A will occur). To address these issues, our scheme proposes a transaction dependency checking system that integrates a dependency-tree-based management approach to dynamically prioritize transactions based on factors such as the tree level, arrival time, and starvation possibility. Our scheme constructs a dependency tree for transactions in a block to be executed in parallel over multiple execution units. We rearrange the transactions into blocks in such a way that the dependency among the transactions are removed as far as possible. This allows parallel execution of transactions to be performed without collision, enhancing the throughput against the conventional implementation of Hyperledger Fabric. Our illustrative implementation of the proposed scheme in a testbed for trading renewable energy shows a performance improvement as big as 27%, depending on the input mixture of transactions. A key innovation is the introduction of the Starve-Avoid method, which mitigates data starvation by dynamically adjusting the transaction priorities to balance throughput and fairness, ensuring that no transaction experiences indefinite delays. Unlike existing approaches that require structural modifications to the conventional Hyperledger Fabric, the proposed scheme optimizes the performance as an independent module, maintaining compatibility with the conventional Hyperledger Fabric architecture. Full article
Show Figures

Figure 1

25 pages, 2388 KiB  
Article
Technology Innovation and Social and Behavioral Commitment: A Case Study of Digital Transformation in the Moroccan Insurance Industry
by Soukaina Abdallah-Ou-Moussa, Martin Wynn, Omar Kharbouch, Sara El Aoufi and Zakaria Rouaine
Big Data Cogn. Comput. 2025, 9(2), 31; https://doi.org/10.3390/bdcc9020031 - 5 Feb 2025
Viewed by 1310
Abstract
Digital transformation (DT) has become an imperative for companies seeking to evolve in a constantly changing industrial ecosystem, driven by the continual development and application of innovative digital technologies. Nevertheless, the success rate of DT initiatives remains surprisingly low, which only serves to [...] Read more.
Digital transformation (DT) has become an imperative for companies seeking to evolve in a constantly changing industrial ecosystem, driven by the continual development and application of innovative digital technologies. Nevertheless, the success rate of DT initiatives remains surprisingly low, which only serves to highlight the need for a deeper understanding of the factors that determine the success of these initiatives. This study adopts a quantitative methodological approach to address this challenge, focusing on the Moroccan insurance industry. First, a systematic literature review was undertaken to identify the key change dimensions and related factors that influence DT acceptance, at both individual and corporate levels, as well as the potential risks associated with the adoption of DT. A survey of 100 employees of insurance companies in Morocco was then undertaken to statistically establish the key factors that determine the success of DT in these companies. The research results reveal that planned behavioral factors, as well as the innovative features of digital technologies, exert a positive influence on the attitude toward the acceptance of DT. Furthermore, this positivity translates into greater personal acceptance of new technologies within the Moroccan organizations studied. Although this paper focuses on one industry sector in one country, the authors believe the results make a valid contribution to both theory and practice. The findings indicate a clear distinction between individual acceptance of innovation and acceptance at a social level, an approach that has scarcely been addressed in previous research. It also offers valuable insights for leaders and organizational managers seeking to succeed in their DT projects by highlighting key determining factors to effectively guide this complex process. Full article
Show Figures

Figure 1

24 pages, 1187 KiB  
Article
Ranking Influential Non-Content Factors on Scientific Papers’ Citation Impact: A Multidomain Comparative Analysis
by Jiannan Zhu, Jiayi Zhou, Jiaofeng Pan, Fu Gu and Jianfeng Guo
Big Data Cogn. Comput. 2025, 9(2), 30; https://doi.org/10.3390/bdcc9020030 - 5 Feb 2025
Viewed by 1352
Abstract
The influence of scientific papers is measured by their citations. Although predicting the papers’ citation impact based on non-content factors has garnered extensive attention, the influence of such factors is rarely compared. In this article, we compare the influence of non-content factors on [...] Read more.
The influence of scientific papers is measured by their citations. Although predicting the papers’ citation impact based on non-content factors has garnered extensive attention, the influence of such factors is rarely compared. In this article, we compare the influence of non-content factors on the citation counts of academic publications across three fields, i.e., math, computer science, and management. We consider different methods in this study, including three machine learning approaches, namely, XGBoost, Gradient Boosting Decision Tree, and Random Forest, along with statistical techniques such as linear regression and quantile analysis. Our findings reveal that no matter the field or analytical method applied, author prestige and the number of references consistently stand out as the most influential factors, while the breadth of categories covered by the paper has minimal impact. In mathematics, the first citation date and article length are almost equally important as author prestige, while the number of authors and the journal impact factor are crucial for computer science papers. In management, the number of collaborating countries is relatively influential with respect to the paper’s citations. The results of the quantile regression indicate that at higher quantile levels, the impact of author prestige and the number of authors on the papers’ citation impact are more pronounced across all three disciplines, while the journal impact factor and paper length have the greatest influence at low and medium quantile levels. Our findings indicate that the reliance of academic citations on author prestige and journal impact factors not only highlights the unequal distribution of resources within the current academic system but also further exacerbates citation inequality. Full article
Show Figures

Figure 1

21 pages, 7163 KiB  
Article
VSA-GCNN: Attention Guided Graph Neural Networks for Brain Tumor Segmentation and Classification
by Kambham Pratap Joshi, Vishruth Boraiah Gowda, Parameshachari Bidare Divakarachari, Paramesh Siddappa Parameshwarappa and Raj Kumar Patra
Big Data Cogn. Comput. 2025, 9(2), 29; https://doi.org/10.3390/bdcc9020029 - 31 Jan 2025
Viewed by 1396
Abstract
For the past few decades, brain tumors have had a substantial influence on human life, and pose severe health risks if not treated and diagnosed in the early stages. Brain tumor problems are highly diverse and vary extensively in terms of size, type, [...] Read more.
For the past few decades, brain tumors have had a substantial influence on human life, and pose severe health risks if not treated and diagnosed in the early stages. Brain tumor problems are highly diverse and vary extensively in terms of size, type, and location. This brain tumor diversity makes it challenging to progress an accurate and reliable diagnostic tool. In order to effectively segment and classify the tumor region, still several developments are required to make an accurate diagnosis. Thus, the purpose of this research is to accurately segment and classify brain tumor Magnetic Resonance Images (MRI) to enhance diagnosis. Primarily, the images are collected from BraTS 2019, 2020, and 2021 datasets, which are pre-processed using min–max normalization to eliminate noise. Then, the pre-processed images are given into the segmentation stage, where a Variational Spatial Attention with Graph Convolutional Neural Network (VSA-GCNN) is applied to handle the variations in tumor shape, size, and location. Then, the segmented outputs are processed into feature extraction, where an AlexNet model is used to reduce the dimensionality. Finally, in the classification stage, a Bidirectional Gated Recurrent Unit (Bi-GRU) is employed to classify the brain tumor regions as gliomas and meningiomas. From the results, it is evident that the proposed VSA-GCNN-BiGRU shows superior results on the BraTS 2019 dataset in terms of accuracy (99.98%), sensitivity (99.92%), and specificity (99.91%) when compared with existing models. By considering the BraTS 2020 dataset, the proposed VSA-GCNN-BiGRU shows superior results in terms of Dice similarity coefficient (0.4), sensitivity (97.7%), accuracy (98.2%), and specificity (97.4%). While evaluating with the BraTS 2021 dataset, the proposed VSA-GCNN-BiGRU achieved specificity of 97.6%, Dice similarity of 98.6%, sensitivity of 99.4%, and accuracy of 99.8%. From the overall observation, the proposed VSA-GCNN-BiGRU supports accurate brain tumor segmentation and classification, which provides clinical significance in MRI when compared to existing models. Full article
Show Figures

Figure 1

25 pages, 10920 KiB  
Article
Lightweight GAN-Assisted Class Imbalance Mitigation for Apple Flower Bud Detection
by Wenan Yuan and Peng Li
Big Data Cogn. Comput. 2025, 9(2), 28; https://doi.org/10.3390/bdcc9020028 - 29 Jan 2025
Viewed by 972
Abstract
Multi-class object detectors often suffer from the class imbalance issue, where substantial model performance discrepancies exist between classes. Generative adversarial networks (GANs), an emerging deep learning research topic, are able to learn from existing data distributions and generate similar synthetic data, which might [...] Read more.
Multi-class object detectors often suffer from the class imbalance issue, where substantial model performance discrepancies exist between classes. Generative adversarial networks (GANs), an emerging deep learning research topic, are able to learn from existing data distributions and generate similar synthetic data, which might serve as valid training data for improving object detectors. The current study investigated the utility of lightweight unconditional GAN in addressing weak object detector class performance by incorporating synthetic data into real data for model retraining, under an agricultural context. AriAplBud, a multi-growth stage aerial apple flower bud dataset was deployed in the study. A baseline YOLO11n detector was first developed based on training, validation, and test datasets derived from AriAplBud. Six FastGAN models were developed based on dedicated subsets of the same YOLO training and validation datasets for different apple flower bud growth stages. Positive sample rates and average instance number per image of synthetic data generated by each of the FastGAN models were investigated based on 1000 synthetic images and the baseline detector at various confidence thresholds. In total, 13 new YOLO11n detectors were retrained specifically for the two weak growth stages, tip and half-inch green, by including synthetic data in training datasets to increase total instance number to 1000, 2000, 4000, and 8000, respectively, pseudo-labeled by the baseline detector. FastGAN showed its resilience in successfully generating positive samples, despite apple flower bud instances being generally small and randomly distributed in the images. Positive sample rates of the synthetic datasets were negatively correlated with the detector confidence thresholds as expected, which ranged from 0 to 1. Higher overall positive sample rates were observed for the growth stages with higher detector performance. The synthetic images generally contained fewer detector-detectable instances per image than the corresponding real training images. The best achieved YOLO11n AP improvements in the retrained detectors for tip and half-inch green were 30.13% and 14.02% respectively, while the best achieved YOLO11n mAP improvement was 2.83%. However, the relationship between synthetic training instance quantity and detector class performances had yet to be determined. GAN was concluded to be beneficial in retraining object detectors and improving their performances. Further studies are still in need to investigate the influence of synthetic training data quantity and quality on retrained object detector performance. Full article
Show Figures

Figure 1

49 pages, 17199 KiB  
Article
Application of Symbolic Classifiers and Multi-Ensemble Threshold Techniques for Android Malware Detection
by Nikola Anđelić, Sandi Baressi Šegota and Vedran Mrzljak
Big Data Cogn. Comput. 2025, 9(2), 27; https://doi.org/10.3390/bdcc9020027 - 29 Jan 2025
Viewed by 746
Abstract
Android malware detection using artificial intelligence today is a mandatory tool to prevent cyber attacks. To address this problem in this paper the proposed methodology consists of the application of genetic programming symbolic classifier (GPSC) to obtain symbolic expressions (SEs) that can detect [...] Read more.
Android malware detection using artificial intelligence today is a mandatory tool to prevent cyber attacks. To address this problem in this paper the proposed methodology consists of the application of genetic programming symbolic classifier (GPSC) to obtain symbolic expressions (SEs) that can detect if the android is malware or not. To find the optimal combination of GPSC hyperparameter values the random hyperparameter values search method (RHVS) method and the GPSC were trained using 5-fold cross-validation (5FCV). It should be noted that the initial dataset is highly imbalanced (publicly available dataset). This problem was addressed by applying various preprocessing and oversampling techniques thus creating a huge number of balanced dataset variations and on each dataset variation the GPSC was trained. Since the dataset has many input variables three different approaches were considered: the initial investigation with all input variables, input variables with high feature importance, application of principal component analysis. After the SEs with the highest classification performance were obtained they were used in threshold-based voting ensembles and the threshold values were adjusted to improve classification performance. Multi-TBVE has been developed and using them the robust system for Android malware detection was achieved with the highest accuracy of 0.98 was obtained. Full article
(This article belongs to the Special Issue Big Data Analytics with Machine Learning for Cyber Security)
Show Figures

Figure 1

41 pages, 5100 KiB  
Article
Leveraging Open Big Data from R&D Projects with Large Language Models
by Desireé Ruiz, Yudith Cardinale, Abraham Casas and Vanessa Moscardó
Big Data Cogn. Comput. 2025, 9(2), 26; https://doi.org/10.3390/bdcc9020026 - 28 Jan 2025
Viewed by 901
Abstract
Recent studies have highlighted the potential of Large Language Models (LLMs) to become experts in specific areas of knowledge through the utilization of techniques that enhance their context. Nevertheless, an interesting and underexplored application in the literature is the creation of an LLM [...] Read more.
Recent studies have highlighted the potential of Large Language Models (LLMs) to become experts in specific areas of knowledge through the utilization of techniques that enhance their context. Nevertheless, an interesting and underexplored application in the literature is the creation of an LLM that specializes in research projects, as it could streamline the process of project ideation and accelerate the advancement of research initiatives. In this regard, the aim of this work is to develop a tool based on LLM technology capable of assisting the employees of technology centers in answering their queries related to research projects funded under the Horizon 2020 program. By facilitating the identification of suitable funding calls and the formation of consortia with partners meeting specific requirements, tasks that are traditionally time-intensive, the proposed tool has the potential to improve operational efficiency and enable technology centers to allocate their resources more effectively. To improve the model’s baseline performance, context extension techniques such as Retrieved Augmented Generation (RAG) and prompt engineering were explored. Specifically, different RAG approaches and configurations, along with a specialized prompt, were tested on the LLaMA 3 70B model, and their results were compared to those obtained without context extension. The proposed evaluation metrics, which aligned with human judgment while maintaining objectivity, revealed that RAG systems outperformed the standalone LLaMA 3 70B, achieving a rate of optimal responses of up to 46% compared to 0% for the baseline model. These findings emphasize that integrating RAG and prompt engineering pipelines into LLMs can address key limitations, such as generating accurate and informative answers. Moreover, this study demonstrates the practical feasibility of leveraging advanced LLM configurations to support research-driven organizations, highlighting a pathway for the further development of intelligent tools that enhance productivity and foster innovation in the research domain. Full article
Show Figures

Figure 1

17 pages, 4219 KiB  
Article
Optimizing Convolutional Neural Network Architectures with Optimal Activation Functions for Pediatric Pneumonia Diagnosis Using Chest X-Rays
by Petra Radočaj, Dorijan Radočaj and Goran Martinović
Big Data Cogn. Comput. 2025, 9(2), 25; https://doi.org/10.3390/bdcc9020025 - 27 Jan 2025
Cited by 3 | Viewed by 1186
Abstract
Pneumonia remains a significant cause of morbidity and mortality among pediatric patients worldwide. Accurate and timely diagnosis is crucial for effective treatment and improved patient outcomes. Traditionally, pneumonia diagnosis has relied on a combination of clinical evaluation and radiologists’ interpretation of chest X-rays. [...] Read more.
Pneumonia remains a significant cause of morbidity and mortality among pediatric patients worldwide. Accurate and timely diagnosis is crucial for effective treatment and improved patient outcomes. Traditionally, pneumonia diagnosis has relied on a combination of clinical evaluation and radiologists’ interpretation of chest X-rays. However, this process is time-consuming and prone to inconsistencies in diagnosis. The integration of advanced technologies such as Convolutional Neural Networks (CNNs) into medical diagnostics offers a potential to enhance the accuracy and efficiency. In this study, we conduct a comprehensive evaluation of various activation functions within CNNs for pediatric pneumonia classification using a dataset of 5856 chest X-ray images. The novel Mish activation function was compared with Swish and ReLU, demonstrating superior performance in terms of accuracy, precision, recall, and F1-score in all cases. Notably, InceptionResNetV2 combined with Mish activation function achieved the highest overall performance with an accuracy of 97.61%. Although the dataset used may not fully represent the diversity of real-world clinical cases, this research provides valuable insights into the influence of activation functions on CNN performance in medical image analysis, laying a foundation for future automated pneumonia diagnostic systems. Full article
(This article belongs to the Topic Applied Computing and Machine Intelligence (ACMI))
Show Figures

Figure 1

31 pages, 634 KiB  
Article
BankNet: Real-Time Big Data Analytics for Secure Internet Banking
by Kaushik Sathupadi, Sandesh Achar, Shinoy Vengaramkode Bhaskaran, Nuruzzaman Faruqui and Jia Uddin
Big Data Cogn. Comput. 2025, 9(2), 24; https://doi.org/10.3390/bdcc9020024 - 26 Jan 2025
Cited by 2 | Viewed by 3345
Abstract
The rapid growth of Internet banking has necessitated advanced systems for secure, real-time decision making. This paper introduces BankNet, a predictive analytics framework integrating big data tools and a BiLSTM neural network to deliver high-accuracy transaction analysis. BankNet achieves exceptional predictive performance, with [...] Read more.
The rapid growth of Internet banking has necessitated advanced systems for secure, real-time decision making. This paper introduces BankNet, a predictive analytics framework integrating big data tools and a BiLSTM neural network to deliver high-accuracy transaction analysis. BankNet achieves exceptional predictive performance, with a Root Mean Squared Error of 0.0159 and fraud detection accuracy of 98.5%, while efficiently handling data rates up to 1000 Mbps with minimal latency. By addressing critical challenges in fraud detection and operational efficiency, BankNet establishes itself as a robust decision support system for modern Internet banking. Its scalability and precision make it a transformative tool for enhancing security and trust in financial services. Full article
Show Figures

Figure 1

21 pages, 806 KiB  
Article
Labeling Network Intrusion Detection System (NIDS) Rules with MITRE ATT&CK Techniques: Machine Learning vs. Large Language Models
by Nir Daniel, Florian Klaus Kaiser, Shay Giladi, Sapir Sharabi, Raz Moyal, Shalev Shpolyansky, Andres Murillo, Aviad Elyashar and Rami Puzis
Big Data Cogn. Comput. 2025, 9(2), 23; https://doi.org/10.3390/bdcc9020023 - 26 Jan 2025
Viewed by 1428
Abstract
Analysts in Security Operations Centers (SOCs) are often occupied with time-consuming investigations of alerts from Network Intrusion Detection Systems (NIDSs). Many NIDS rules lack clear explanations and associations with attack techniques, complicating the alert triage and the generation of attack hypotheses. Large Language [...] Read more.
Analysts in Security Operations Centers (SOCs) are often occupied with time-consuming investigations of alerts from Network Intrusion Detection Systems (NIDSs). Many NIDS rules lack clear explanations and associations with attack techniques, complicating the alert triage and the generation of attack hypotheses. Large Language Models (LLMs) may be a promising technology to reduce the alert explainability gap by associating rules with attack techniques. In this paper, we investigate the ability of three prominent LLMs (ChatGPT, Claude, and Gemini) to reason about NIDS rules while labeling them with MITRE ATT&CK tactics and techniques. We discuss prompt design and present experiments performed with 973 Snort rules. Our results indicate that while LLMs provide explainable, scalable, and efficient initial mappings, traditional machine learning (ML) models consistently outperform them in accuracy, achieving higher precision, recall, and F1-scores. These results highlight the potential for hybrid LLM-ML approaches to enhance SOC operations and better address the evolving threat landscape. By utilizing automation, the presented methods will enhance the analysis efficiency of SOC alerts, and decrease workloads for analysts. Full article
(This article belongs to the Special Issue Generative AI and Large Language Models)
Show Figures

Figure 1

17 pages, 4119 KiB  
Article
Evaluating the Effect of Surrogate Data Generation on Healthcare Data Assessment
by Saeid Sanei, Tracey K. M. Lee, Issam Boukhennoufa, Delaram Jarchi, Xiaojun Zhai and Klaus McDonald-Maier
Big Data Cogn. Comput. 2025, 9(2), 22; https://doi.org/10.3390/bdcc9020022 - 26 Jan 2025
Viewed by 662
Abstract
In healthcare applications, often it is not possible to record sufficient data as required for deep learning or data-driven classification and feature detection systems due to the patient condition, various clinical or experimental limitations, or time constraints. On the other hand, data imbalance [...] Read more.
In healthcare applications, often it is not possible to record sufficient data as required for deep learning or data-driven classification and feature detection systems due to the patient condition, various clinical or experimental limitations, or time constraints. On the other hand, data imbalance invalidates many of the test results crucial for clinical approvals. Generating synthetic (artificial or dummy) data has become a potential solution to address this issue. Such data should possess adequate information, properties, and characteristics to mimic the real-world data recorded in natural circumstances. Several methods have been proposed for this purpose, and results often show that adding surrogates improves the decision-making accuracy. This article evaluates the most recent surrogate data generation and data synthesis methods to investigate the effects of the number of surrogates on improving the classification results. It is shown that the data analysis/classification results improve with an increasing number of surrogates, but this no longer continues after a certain number of surrogates. This achievement helps in deciding on the number of surrogates for each strategy, resulting in the alleviation of the computation cost. Full article
Show Figures

Figure 1

30 pages, 6147 KiB  
Article
Long-Term Forecasting of Solar Irradiation in Riyadh, Saudi Arabia, Using Machine Learning Techniques
by Khalil AlSharabi, Yasser Bin Salamah, Majid Aljalal, Akram M. Abdurraqeeb and Fahd A. Alturki
Big Data Cogn. Comput. 2025, 9(2), 21; https://doi.org/10.3390/bdcc9020021 - 25 Jan 2025
Cited by 1 | Viewed by 1356
Abstract
Forecasting of time series data presents some challenges because the data’s nature is complex and therefore difficult to accurately forecast. This study presents the design and development of a novel forecasting system that integrates efficient data processing techniques with advanced machine learning algorithms [...] Read more.
Forecasting of time series data presents some challenges because the data’s nature is complex and therefore difficult to accurately forecast. This study presents the design and development of a novel forecasting system that integrates efficient data processing techniques with advanced machine learning algorithms to improve time series forecasting across the sustainability domain. Specifically, this study focuses on solar irradiation forecasting in Riyadh, Saudi Arabia. Efficient and accurate forecasts of solar irradiation are important for optimizing power production and its smooth integration into the utility grid. This advancement supports Saudi Arabia in Vision 2030, which aims to generate and utilize renewable energy sources to drive sustainable development. Therefore, the proposed forecasting system has been developed to the parameters characteristic of the Riyadh region environment, including high solar intensity, dust storms, and unpredictable weather conditions. After the cleaning and filtering process, the filtered dataset was pre-processed using the standardization method. Then, the Discrete Wavelet Transform (DWT) technique has been applied to extract the features of the pre-processed data. Next, the extracted features of the solar dataset have been split into three subsets: train, test, and forecast. Finally, two different machine learning techniques have been utilized for the forecasting process: Support Vector Machine (SVM) and Gaussian Process (GP) techniques. The proposed forecasting system has been evaluated across different time horizons: one-day, five-day, ten-day, and fifteen-day ahead. Comprehensive evaluation metrics were calculated including accuracy, stability, and generalizability measures. The study outcomes present the proposed forecasting system which provides a more robust and adaptable solution for time-series long-term forecasting and complex patterns of solar irradiation in Riyadh, Saudi Arabia. Full article
Show Figures

Figure 1

15 pages, 1119 KiB  
Article
Fit Talks: Forecasting Fitness Awareness in Saudi Arabia Using Fine-Tuned Transformers
by Nora Alturayeif, Deemah Alqahtani, Sumayh S. Aljameel, Najla Almajed, Lama Alshehri, Nourah Aldhuwaihi, Madawi Alhadyan and Nouf Aldakheel
Big Data Cogn. Comput. 2025, 9(2), 20; https://doi.org/10.3390/bdcc9020020 - 23 Jan 2025
Viewed by 1043
Abstract
Understanding public sentiment on health and fitness is essential for addressing regional health challenges in Saudi Arabia. This research employs sentiment analysis to assess fitness awareness by analyzing content from the X platform (formerly Twitter), using a dataset called Saudi Aware, which includes [...] Read more.
Understanding public sentiment on health and fitness is essential for addressing regional health challenges in Saudi Arabia. This research employs sentiment analysis to assess fitness awareness by analyzing content from the X platform (formerly Twitter), using a dataset called Saudi Aware, which includes 3593 posts related to fitness awareness. Preprocessing steps such as normalization, stop-word removal, and tokenization ensured high-quality data. The findings revealed that positive sentiments about fitness and health were more prevalent than negative ones, with posts across all sentiment categories being most common in the western region. However, the eastern region exhibited the highest percentage of positive sentiment, indicating a strong interest in fitness and health. For sentiment classification, we fine-tuned two transformer architectures—BERT and GPT—utilizing three BERT-based models (AraBERT, MARBERT, CAMeLBERT) and GPT-3.5. These findings provide valuable insights into Saudi Arabian attitudes toward fitness and health, offering actionable information for public health campaigns and initiatives. Full article
Show Figures

Figure 1

21 pages, 7488 KiB  
Article
Low-Cost Embedded System Applications for Smart Cities
by Victoria Alejandra Salazar Herrera, Hugo Puertas de Araújo, César Giacomini Penteado, Mario Gazziro and João Paulo Carmo
Big Data Cogn. Comput. 2025, 9(2), 19; https://doi.org/10.3390/bdcc9020019 - 22 Jan 2025
Viewed by 1413
Abstract
The Internet of Things (IoT) represents a transformative technology that allows interconnected devices to exchange data over the Internet, enabling automation and real-time decision making in a variety of areas. A key aspect of the success of the IoT lies in its integration [...] Read more.
The Internet of Things (IoT) represents a transformative technology that allows interconnected devices to exchange data over the Internet, enabling automation and real-time decision making in a variety of areas. A key aspect of the success of the IoT lies in its integration with low-resource hardware, such as low-cost microprocessors and microcontrollers. These devices, which are affordable and energy efficient, are capable of handling basic tasks such as sensing, processing, and data transmission. Their low cost makes them ideal for IoT applications in low-income communities where the government is often absent. This review aims to present some applications—such as a flood detection system; a monitoring system for analog and digital sensors; an air quality measurement system; a mesh video network for community surveillance; and a real-time fleet management system—that use low-cost hardware such as ESP32, Raspberry Pi, and Arduino, and the MQTT protocol used to implement low-cost monitoring systems applied to improve the quality of life of people in small cities or communities. Full article
(This article belongs to the Special Issue Application of Cloud Computing in Industrial Internet of Things)
Show Figures

Figure 1

Previous Issue
Next Issue
Back to TopTop