Big Data and Cognitive Computing

20 pages, 2833 KiB

Open AccessArticle

A Multi-Level Annotation Model for Fake News Detection: Implementing Kazakh-Russian Corpus via Label Studio

by Madina Sambetbayeva, Anargul Nekessova, Aigerim Yerimbetova, Abdygalym Bayangali, Mira Kaldarova, Duman Telman and Nurzhigit Smailov

Big Data Cogn. Comput. 2025, 9(8), 215; https://doi.org/10.3390/bdcc9080215 - 20 Aug 2025

Abstract

This paper presents a multi-level annotation model for detecting fake news in Kazakh and Russian languages, aiming to enhance understanding of disinformation strategies in multilingual digital media environments. Unlike traditional binary models, our approach captures the complexity of disinformation by accounting for both [...] Read more.

This paper presents a multi-level annotation model for detecting fake news in Kazakh and Russian languages, aiming to enhance understanding of disinformation strategies in multilingual digital media environments. Unlike traditional binary models, our approach captures the complexity of disinformation by accounting for both linguistic and cultural factors. To support this, a corpus of over 5000 news texts was manually annotated using the Label Studio platform. The annotation scheme consists of seven interrelated categories: CLAIM, SOURCE, EVIDENCE, DISINFORMATION_TECHNIQUE, AUTHOR_INTENT, TARGET_AUDIENCE, and TIMESTAMP. Inter-annotator agreement, evaluated using Cohen’s Kappa, ranged from 0.72 to 0.81, indicating substantial consistency. The annotated data reveals recurring patterns of disinformation, such as emotional manipulation, targeting of vulnerable individuals, and the strategic concealment of intent. Semantic relations between entities, such as CLAIM → EVIDENCE and CLAIM → AUTHOR_INTENT were formalized to represent disinformation narratives as knowledge graphs. This study contributes the first linguistically and culturally adapted annotation model for Kazakh and Russian languages, providing a robust and empirical resource for building interpretable and context-aware fake news detection systems. The resulting annotated corpus and its semantic structure offer valuable empirical material for further research in natural language processing, computational linguistics, and media studies in low-resource language environments. Full article

► Show Figures

Figure 1

18 pages, 5825 KiB

Open AccessArticle

Detection and Localization of Hidden IoT Devices in Unknown Environments Based on Channel Fingerprints

by Xiangyu Ju, Yitang Chen, Zhiqiang Li and Biao Han

Big Data Cogn. Comput. 2025, 9(8), 214; https://doi.org/10.3390/bdcc9080214 - 20 Aug 2025

Abstract

In recent years, hidden IoT monitoring devices installed indoors have raised significant concerns about privacy breaches and other security threats. To address the challenges of detecting such devices, low positioning accuracy, and lengthy detection times, this paper proposes a hidden device detection and [...] Read more.

In recent years, hidden IoT monitoring devices installed indoors have raised significant concerns about privacy breaches and other security threats. To address the challenges of detecting such devices, low positioning accuracy, and lengthy detection times, this paper proposes a hidden device detection and localization system that operates on the Android platform. This technology utilizes the Received Signal Strength Indication (RSSI) signals received by the detection terminal device to achieve the detection, classification, and localization of hidden IoT devices in unfamiliar environments. This technology integrates three key designs: (1) actively capturing the RSSI sequence of hidden devices by sending RTS frames and receiving CTS frames, which is used to generate device channel fingerprints and estimate the distance between hidden devices and detection terminals; (2) training an RSSI-based ranging model using the XGBoost algorithm, followed by multi-point localization for accurate positioning; (3) implementing augmented reality-based visual localization to support handheld detection terminals. This prototype system successfully achieves active data sniffing based on RTS/CTS and terminal localization based on the RSSI-based ranging model, effectively reducing signal acquisition time and improving localization accuracy. Real-world experiments show that the system can detect and locate hidden devices in unfamiliar environments, achieving an accuracy of 98.1% in classifying device types. The time required for detection and localization is approximately one-sixth of existing methods, with system runtime maintained within 5 min. The localization error is 0.77 m, a 48.7% improvement over existing methods with an average error of 1.5 m. Full article

(This article belongs to the Special Issue Machine Learning Methodologies and Applications in Cybersecurity Data Analysis)

► Show Figures

Figure 1

23 pages, 811 KiB

Open AccessArticle

Efficient Dynamic Emotion Recognition from Facial Expressions Using Statistical Spatio-Temporal Geometric Features

by Yacine Yaddaden

Big Data Cogn. Comput. 2025, 9(8), 213; https://doi.org/10.3390/bdcc9080213 - 19 Aug 2025

Abstract

Automatic Facial Expression Recognition (AFER) is a key component of affective computing, enabling machines to recognize and interpret human emotions across various applications such as human–computer interaction, healthcare, entertainment, and social robotics. Dynamic AFER systems, which exploit image sequences, can capture the temporal [...] Read more.

Automatic Facial Expression Recognition (AFER) is a key component of affective computing, enabling machines to recognize and interpret human emotions across various applications such as human–computer interaction, healthcare, entertainment, and social robotics. Dynamic AFER systems, which exploit image sequences, can capture the temporal evolution of facial expressions but often suffer from high computational costs, limiting their suitability for real-time use. In this paper, we propose an efficient dynamic AFER approach based on a novel spatio-temporal representation. Facial landmarks are extracted, and all possible Euclidean distances are computed to model the spatial structure. To capture temporal variations, three statistical metrics are applied to each distance sequence. A feature selection stage based on the Extremely Randomized Trees (ExtRa-Trees) algorithm is then performed to reduce dimensionality and enhance classification performance. Finally, the emotions are classified using a linear multi-class Support Vector Machine (SVM) and compared against the k-Nearest Neighbors (k-NN) method. The proposed approach is evaluated on three benchmark datasets: CK+, MUG, and MMI, achieving recognition rates of 94.65%, 93.98%, and 75.59%, respectively. Our results demonstrate that the proposed method achieves a strong balance between accuracy and computational efficiency, making it well-suited for real-time facial expression recognition applications. Full article

(This article belongs to the Special Issue Perception and Detection of Intelligent Vision)

► Show Figures

Figure 1

17 pages, 1151 KiB

Open AccessArticle

Proposal of a Blockchain-Based Data Management System for Decentralized Artificial Intelligence Devices

by Keundug Park and Heung-Youl Youm

Big Data Cogn. Comput. 2025, 9(8), 212; https://doi.org/10.3390/bdcc9080212 - 18 Aug 2025

Abstract

A decentralized artificial intelligence (DAI) system is a human-oriented artificial intelligence (AI) system, which performs self-learning and shares its knowledge with other DAI systems like humans. A DAI device is an individual device (e.g., a mobile phone, a personal computer, a robot, a [...] Read more.

A decentralized artificial intelligence (DAI) system is a human-oriented artificial intelligence (AI) system, which performs self-learning and shares its knowledge with other DAI systems like humans. A DAI device is an individual device (e.g., a mobile phone, a personal computer, a robot, a car, etc.) running a DAI system. A DAI device acquires validated knowledge data and raw data from a blockchain system as a trust anchor and improves its knowledge level by self-learning using the validated data. A DAI device using the proposed system reduces unreliable tasks, including the generation of unreliable products (e.g., deepfakes, fake news, and hallucinations), but the proposed system also prevents these malicious DAI devices from acquiring the validated data. This paper proposes a new architecture for a blockchain-based data management system for DAI devices, together with the service scenario and data flow, security threats, and security requirements. It also describes the key features and expected effects of the proposed system. This paper discusses the considerations for developing or operating the proposed system and concludes with future works. Full article

(This article belongs to the Special Issue Transforming Cyber Security Provision through Utilizing Artificial Intelligence)

► Show Figures

Figure 1

20 pages, 3178 KiB

Open AccessArticle

Empathetic Response Generation Based on Emotional Transition Prompt and Dual-Semantic Contrastive Learning

by Yanying Mao, Yijia Zhang, Taihua Shao and Honghui Chen

Big Data Cogn. Comput. 2025, 9(8), 211; https://doi.org/10.3390/bdcc9080211 - 18 Aug 2025

Abstract

Empathetic response generation stands as a pivotal endeavor in the development of human-like dialogue systems. An effective approach in previous research is integrating external knowledge to generate empathetic responses. However, existing approaches only focus on identifying a user’s current emotional state, and they [...] Read more.

Empathetic response generation stands as a pivotal endeavor in the development of human-like dialogue systems. An effective approach in previous research is integrating external knowledge to generate empathetic responses. However, existing approaches only focus on identifying a user’s current emotional state, and they overlook the user’s emotional transition during context, and fail to propel the sustainability of the dialogue. To tackle the aforementioned issues, we propose an empathetic response generation model based on an emotional transition prompt and dual-semantic contrastive learning (EPDC). Specifically, we first compute the transition in users’ sentiment polarity during the conversation and incorporate it into the conversation embedding as sentiment prompts. Then, we generate two distinct fine-grained contextual representations and treat them as positive examples for contrastive learning, respectively, aiming at extracting high-order semantic information to guide the subsequent turn of dialogue. Finally, we also leverage commonsense knowledge to enhance the contextual representations, and the empathetic responses are generated by decoding the combination of semantic and emotional states. Notably, our work represents the pioneering application of emotional prompts and contrastive learning to augment the sustainability of empathetic dialogue. Extensive experiments conducted on the benchmark dataset EMPATHETICDIALOGUES demonstrate that EPDC outperforms the baselines in both automatic evaluations and human evaluations. Full article

(This article belongs to the Special Issue Application of Semantic Technologies in Intelligent Environment)

► Show Figures

Figure 1

20 pages, 4173 KiB

Open AccessArticle

AI-Based Phishing Detection and Student Cybersecurity Awareness in the Digital Age

by Zeinab Shahbazi, Rezvan Jalali and Maryam Molaeevand

Big Data Cogn. Comput. 2025, 9(8), 210; https://doi.org/10.3390/bdcc9080210 - 15 Aug 2025

Abstract

Phishing attacks are an increasingly common cybersecurity threat and are characterized by deceiving people into giving out their private credentials via emails, websites, and messages. An insight into students’ challenges in recognizing phishing threats can provide valuable information on how AI-based detection systems [...] Read more.

Phishing attacks are an increasingly common cybersecurity threat and are characterized by deceiving people into giving out their private credentials via emails, websites, and messages. An insight into students’ challenges in recognizing phishing threats can provide valuable information on how AI-based detection systems can be improved to enhance accuracy, reduce false positives, and build user trust in cybersecurity. This study focuses on students’ awareness of phishing attempts and evaluates AI-based phishing detection systems. Questionnaires were circulated amongst students, and responses were evaluated to uncover prevailing patterns and issues. The results indicate that most college students are knowledgeable about phishing methods, but many do not recognize the dangers of phishing. Because of this, AI-based detection systems have potential but also face issues relating to accuracy, false positives, and user faith. This research highlights the importance of bolstering cybersecurity education and ongoing enhancements to AI models to improve phishing detection. Future studies should include a more representative sample, evaluate AI detection systems in real-world settings, and assess longer-term changes in phishing-related awareness. By combining AI-driven solutions with education a safer digital world can created. Full article

(This article belongs to the Special Issue Big Data Analytics with Machine Learning for Cyber Security)

► Show Figures

Figure 1

25 pages, 15383 KiB

Open AccessArticle

SplitGround: Long-Chain Reasoning Split via Modular Multi-Expert Collaboration for Training-Free Scene Knowledge-Guided Visual Grounding

by Xilong Qin, Yue Hu, Wansen Wu, Xinmeng Li and Quanjun Yin

Big Data Cogn. Comput. 2025, 9(8), 209; https://doi.org/10.3390/bdcc9080209 - 14 Aug 2025

Abstract

Scene Knowledge-guided Visual Grounding (SK-VG) is a multi-modal detection task built upon conventional visual grounding (VG) for human–computer interaction scenarios. It utilizes an additional passage of scene knowledge apart from the image and context-dependent textual query for referred object localization. Due to the [...] Read more.

Scene Knowledge-guided Visual Grounding (SK-VG) is a multi-modal detection task built upon conventional visual grounding (VG) for human–computer interaction scenarios. It utilizes an additional passage of scene knowledge apart from the image and context-dependent textual query for referred object localization. Due to the inherent difficulty in directly establishing correlations between the given query and the image without leveraging scene knowledge, this task imposes significant demands on a multi-step knowledge reasoning process to achieve accurate grounding. Off-the-shelf VG models underperform under such a setting due to the requirement of detailed description in the query and a lack of knowledge inference based on implicit narratives of the visual scene. Recent Vision–Language Models (VLMs) exhibit improved cross-modal reasoning capabilities. However, their monolithic architectures, particularly in lightweight implementations, struggle to maintain coherent reasoning chains across sequential logical deductions, leading to error accumulation in knowledge integration and object localization. To address the above-mentioned challenges, we propose SplitGround—a collaborative framework that strategically decomposes complex reasoning processes by fusing the input query and image with knowledge through two auxiliary modules. Specifically, it implements an Agentic Annotation Workflow (AAW) for explicit image annotation and a Synonymous Conversion Mechanism (SCM) for semantic query transformation. This hierarchical decomposition enables VLMs to focus on essential reasoning steps while offloading auxiliary cognitive tasks to specialized modules, effectively splitting long reasoning chains into manageable subtasks with reduced complexity. Comprehensive evaluations on the SK-VG benchmark demonstrate the significant advancements of our method. Remarkably, SplitGround attains an accuracy improvement of 15.71% on the hard split of the test set over the previous training-required SOTA, using only a compact VLM backbone without fine-tuning, which provides new insights for knowledge-intensive visual grounding tasks. Full article

► Show Figures

Figure 1

22 pages, 2132 KiB

Open AccessArticle

Ontology Matching Method Based on Deep Learning and Syntax

by Jiawei Lu and Changfeng Yan

Big Data Cogn. Comput. 2025, 9(8), 208; https://doi.org/10.3390/bdcc9080208 - 14 Aug 2025

Abstract

Ontology technology addresses data heterogeneity challenges in Internet of Everything (IoE) systems enabled by Cyber Twin and 6G, yet the subjective nature of ontology engineering often leads to differing definitions of the same concept across ontologies, resulting in ontology heterogeneity. To solve this [...] Read more.

Ontology technology addresses data heterogeneity challenges in Internet of Everything (IoE) systems enabled by Cyber Twin and 6G, yet the subjective nature of ontology engineering often leads to differing definitions of the same concept across ontologies, resulting in ontology heterogeneity. To solve this problem, this study introduces a hybrid ontology matching method that integrates a Recurrent Neural Network (RNN) with syntax-based analysis. The method first extracts representative entities by leveraging in-degree and out-degree information from ontological tree structures, which reduces training noise and improves model generalization. Next, a matching framework combining RNN and N-gram is designed: the RNN captures medium-distance dependencies and complex sequential patterns, supporting the dynamic optimization of embedding parameters and semantic feature extraction; the N-gram module further captures local information and relationships between adjacent characters, improving the coverage of matched entities. The experiments were conducted on the OAEI benchmark dataset, where the proposed method was compared with representative baseline methods from OAEI as well as a Transformer-based method. The results demonstrate that the proposed method achieved an 18.18% improvement in F-measure over the best-performing baseline. This improvement was statistically significant, as validated by the Friedman and Holm tests. Moreover, the proposed method achieves the shortest runtime among all the compared methods. Compared to other RNN-based hybrid frameworks that adopt classical structure-based and semantics-based similarity measures, the proposed method further improved the F-measure by 18.46%. Furthermore, a comparison of time and space complexity with the standalone RNN model and its variants demonstrated that the proposed method achieved high performance while maintaining favorable computational efficiency. These findings confirm the effectiveness and efficiency of the method in addressing ontology heterogeneity in complex IoE environments. Full article

(This article belongs to the Special Issue Evolutionary Computation and Artificial Intelligence: Building a Sustainable Future for Smart Cities)

► Show Figures

Figure 1

20 pages, 1350 KiB

Open AccessArticle

Target-Oriented Opinion Words Extraction Based on Dependency Tree

by Yan Wen, Enhai Yu, Jiawei Qu, Lele Cheng, Yuao Chen and Siyu Lu

Big Data Cogn. Comput. 2025, 9(8), 207; https://doi.org/10.3390/bdcc9080207 - 13 Aug 2025

Abstract

Target-oriented opinion words extraction (TOWE) is a novel subtask of aspect-based sentiment analysis (ABSA), which aims to extract opinion words corresponding to a given opinion target within a sentence. In recent years, neural networks have been widely used to solve this problem and [...] Read more.

Target-oriented opinion words extraction (TOWE) is a novel subtask of aspect-based sentiment analysis (ABSA), which aims to extract opinion words corresponding to a given opinion target within a sentence. In recent years, neural networks have been widely used to solve this problem and have achieved competitive results. However, when faced with complex and long sentences, the existing methods struggle to accurately identify the semantic relationships between distant opinion targets and opinion words. This is primarily because they rely on literal distance, rather than semantic distance, to model the local context or opinion span of the opinion target. To address this issue, we propose a neural network model called DTOWE, which comprises (1) a global module using Inward-LSTM and Outward-LSTM to capture general sentence-level context, and (2) a local module that employs BiLSTM combined with DT-LCF to focus on target-specific opinion spans. DT-LCF is implemented in two ways: DT-LCF-Mask, which uses a binary mask to zero out non-local context beyond a dependency tree distance threshold, α, and DT-LCF-weight, which applies a dynamic weighted decay to downweigh distant context based on semantic distance. These mechanisms leverage dependency tree structures to measure semantic proximity, reducing the impact of irrelevant words and enhancing the accuracy of opinion span detection. Extensive experiments on four benchmark datasets demonstrate that DTOWE outperforms state-of-the-art models. Specifically, DT-LCF-Weight achieves F1-scores of 73.62% (14lap), 82.24% (14res), 75.35% (15res), and 83.83% (16res), with improvements of 2.63% to 3.44% over the previous state-of-the-art (SOTA) model, IOG. Ablation studies confirm that the dependency tree-based distance measurement and DT-LCF mechanism are critical to the model’s effectiveness, validating their ability to handle complex sentences and capture semantic dependencies between targets and opinion words. Full article

► Show Figures

Figure 1

29 pages, 2720 KiB

Open AccessArticle

Research on Multi-Stage Detection of APT Attacks: Feature Selection Based on LDR-RFECV and Hyperparameter Optimization via LWHO

by Lihong Zeng, Honghui Li, Xueliang Fu, Daoqi Han, Shuncheng Zhou and Xin He

Big Data Cogn. Comput. 2025, 9(8), 206; https://doi.org/10.3390/bdcc9080206 - 12 Aug 2025

Abstract

In the highly interconnected digital ecosystem, cyberspace has become the main battlefield for complex attacks such as Advanced Persistent Threat (APT). The complexity and concealment of APT attacks are increasing, posing unprecedented challenges to network security. Current APT detection methods largely depend on [...] Read more.

In the highly interconnected digital ecosystem, cyberspace has become the main battlefield for complex attacks such as Advanced Persistent Threat (APT). The complexity and concealment of APT attacks are increasing, posing unprecedented challenges to network security. Current APT detection methods largely depend on general datasets, making it challenging to capture the stages and complexity of APT attacks. Moreover, existing detection methods often suffer from suboptimal accuracy, high false alarm rates, and a lack of real-time capabilities. In this paper, we introduce LDR-RFECV, a novel feature selection (FS) algorithm that uses LightGBM, Decision Trees (DTs), and Random Forest (RF) as integrated feature evaluators instead of single evaluators in recursive feature elimination algorithms. This approach helps select the optimal feature subset, thereby significantly enhancing detection efficiency. In addition, a novel optimization algorithm called LWHO was proposed, which integrates the Levy flight mechanism with the Wild Horse Optimizer (WHO) to optimize the hyperparameters of the LightGBM model, ultimately enhancing performance in APT attack detection. More importantly, this optimization strategy significantly boosts the detection rate during the lateral movement phase of APT attacks, a pivotal stage where attackers infiltrate key resources. Timely identification is essential for disrupting the attack chain and achieving precise defense. Experimental results demonstrate that the proposed method achieves 97.31% and 98.32% accuracy on two typical APT attack datasets, DAPT2020 and Unraveled, respectively, which is 2.86% and 4.02% higher than the current research methods, respectively. Full article

► Show Figures

Figure 1

23 pages, 888 KiB

Open AccessArticle

Explainable Deep Learning Model for ChatGPT-Rephrased Fake Review Detection Using DistilBERT

by Rania A. AlQadi, Shereen A. Taie, Amira M. Idrees and Esraa Elhariri

Big Data Cogn. Comput. 2025, 9(8), 205; https://doi.org/10.3390/bdcc9080205 - 11 Aug 2025

Abstract

Customers heavily depend on reviews for product information. Fake reviews may influence the perception of product quality, making online reviews less effective. ChatGPT’s (GPT-3.5 and GPT-4) ability to generate human-like reviews and responses to inquiries across several disciplines has increased recently. This leads [...] Read more.

Customers heavily depend on reviews for product information. Fake reviews may influence the perception of product quality, making online reviews less effective. ChatGPT’s (GPT-3.5 and GPT-4) ability to generate human-like reviews and responses to inquiries across several disciplines has increased recently. This leads to an increase in the number of reviewers and applications using ChatGPT to create fake reviews. Consequently, the detection of fake reviews generated or rephrased by ChatGPT has become essential. This paper proposes a new approach that distinguishes ChatGPT-rephrased reviews, considered fake, from real ones, utilizing a balanced dataset to analyze the sentiment and linguistic patterns that characterize both reviews. The proposed model further leverages Explainable Artificial Intelligence (XAI) techniques, including Local Interpretable Model-agnostic Explanations (LIME) and Shapley Additive Explanations (SHAP) for deeper insights into the model’s predictions and the classification logic. The proposed model performs a pre-processing phase that includes part-of-speech (POS) tagging, word lemmatization, tokenization, and then fine-tuned Transformer-based Machine Learning (ML) model DistilBERT for predictions. The obtained experimental results indicate that the proposed fine-tuned DistilBERT, utilizing the constructed balanced dataset along with a pre-processing phase, outperforms other state-of-the-art methods for detecting ChatGPT-rephrased reviews, achieving an accuracy of 97.25% and F1-score of 97.56%. The use of LIME and SHAP techniques not only enhanced the model’s interpretability, but also offered valuable insights into the key factors that affect the differentiation of genuine reviews from ChatGPT-rephrased ones. According to XAI, ChatGPT’s writing style is polite, uses grammatical structure, lacks specific descriptions and information in reviews, uses fancy words, is impersonal, and has deficiencies in emotional expression. These findings emphasize the effectiveness and reliability of the proposed approach. Full article

(This article belongs to the Special Issue Natural Language Processing Applications in Big Data)

► Show Figures

Figure 1

20 pages, 1175 KiB

Open AccessArticle

Rebalancing in Supervised Contrastive Learning for Long-Tailed Visual Recognition

by Jiahui Lv, Jun Lei, Jun Zhang, Chao Chen and Shuohao Li

Big Data Cogn. Comput. 2025, 9(8), 204; https://doi.org/10.3390/bdcc9080204 - 11 Aug 2025

Abstract

In real-world visual recognition tasks, long-tailed distribution is a pervasive challenge, where the extreme class imbalance severely limits the representation learning capability of deep models. Although supervised learning has demonstrated certain potential in long-tailed visual recognition, these models’ gradient updates dominated by head [...] Read more.

In real-world visual recognition tasks, long-tailed distribution is a pervasive challenge, where the extreme class imbalance severely limits the representation learning capability of deep models. Although supervised learning has demonstrated certain potential in long-tailed visual recognition, these models’ gradient updates dominated by head classes often lead to insufficient representation of tail classes, resulting in ambiguous decision boundaries. While existing Supervised Contrastive Learning variants mitigate class bias through instance-level similarity comparison, they are still limited by biased negative sample selection and insufficient modeling of the feature space structure. To address this, we propose Rebalancing Supervised Contrastive Learning (Reb-SupCon), which constructs a balanced and discriminative feature space during model training to alleviate performance deviation. Our method consists of two key components: (1) a dynamic rebalancing factor that automatically adjusts sample contributions through differentiable weighting, thereby establishing class-balanced feature representations; (2) a prototype-aware enhancement module that further improves feature discriminability by explicitly constraining the geometric structure of the feature space through introduced feature prototypes, enabling locally discriminative feature reconstruction. This breaks through the limitations of conventional instance contrastive learning and helps the model to identify more reasonable decision boundaries. Experimental results show that this method demonstrates superior performance on mainstream long-tailed benchmark datasets, with ablation studies and feature visualizations validating the modules’ synergistic effects. Full article

► Show Figures

Figure 1

22 pages, 5355 KiB

Open AccessArticle

Application of a Multi-Algorithm-Optimized CatBoost Model in Predicting the Strength of Multi-Source Solid Waste Backfilling Materials

by Jianhui Qiu, Jielin Li, Xin Xiong and Keping Zhou

Big Data Cogn. Comput. 2025, 9(8), 203; https://doi.org/10.3390/bdcc9080203 - 7 Aug 2025

Abstract

Backfilling materials are commonly employed materials in mines for filling mining waste, and the strength of the consolidated backfill formed by the binding material directly influences the stability of the surrounding rock and production safety in mines. The traditional approach to obtaining the [...] Read more.

Backfilling materials are commonly employed materials in mines for filling mining waste, and the strength of the consolidated backfill formed by the binding material directly influences the stability of the surrounding rock and production safety in mines. The traditional approach to obtaining the strength of the backfill demands a considerable amount of manpower and time. The rapid and precise acquisition and optimization of backfill strength parameters hold utmost significance for mining safety. In this research, the authors carried out a backfill strength experiment with five experimental parameters, namely concentration, cement–sand ratio, waste rock–tailing ratio, curing time, and curing temperature, using an orthogonal design. They collected 174 sets of backfill strength parameters and employed six population optimization algorithms, including the Artificial Ecosystem-based Optimization (AEO) algorithm, Aquila Optimization (AO) algorithm, Germinal Center Optimization (GCO), Sand Cat Swarm Optimization (SCSO), Sparrow Search Algorithm (SSA), and Walrus Optimization Algorithm (WaOA), in combination with the CatBoost algorithm to conduct a prediction study of backfill strength. The study also utilized the Shapley Additive explanatory (SHAP) method to analyze the influence of different parameters on the prediction of backfill strength. The results demonstrate that when the population size was 60, the AEO-CatBoost algorithm model exhibited a favorable fitting effect (R² = 0.947, VAF = 93.614), and the prediction error was minimal (RMSE = 0.606, MAE = 0.465), enabling the accurate and rapid prediction of the strength parameters of the backfill under different ratios and curing conditions. Additionally, an increase in curing temperature and curing time enhanced the strength of the backfill, and the influence of the waste rock–tailing ratio on the strength of the backfill was negative at a curing temperature of 50 °C, which is attributed to the change in the pore structure at the microscopic level leading to macroscopic mechanical alterations. When the curing conditions are adequate and the parameter ratios are reasonable, the smaller the porosity rate in the backfill, the greater the backfill strength will be. This study offers a reliable and accurate method for the rapid acquisition of backfill strength and provides new technical support for the development of filling mining technology. Full article

(This article belongs to the Special Issue Applications of Artificial Intelligence and Data Management in Data Analysis)

► Show Figures

Figure 1

23 pages, 8569 KiB

Open AccessArticle

Evidential K-Nearest Neighbors with Cognitive-Inspired Feature Selection for High-Dimensional Data

by Yawen Liu, Yang Zhang, Xudong Wang and Xinyuan Qu

Big Data Cogn. Comput. 2025, 9(8), 202; https://doi.org/10.3390/bdcc9080202 - 6 Aug 2025

Abstract

The Evidential K-Nearest Neighbor (EK-NN) classifier has demonstrated robustness in handling incomplete and uncertain data; however, its application in high-dimensional big data for feature selection, such as genomic datasets with tens of thousands of gene features, remains underexplored. Our proposed Granular–Elastic Evidential K-Nearest [...] Read more.

The Evidential K-Nearest Neighbor (EK-NN) classifier has demonstrated robustness in handling incomplete and uncertain data; however, its application in high-dimensional big data for feature selection, such as genomic datasets with tens of thousands of gene features, remains underexplored. Our proposed Granular–Elastic Evidential K-Nearest Neighbor (GEK-NN) approach addresses this gap. In the context of big data, GEK-NN integrates an Elastic Net within the Genetic Algorithm’s fitness function to efficiently sift through vast amounts of data, identifying relevant feature subsets. This process mimics human cognitive behavior of filtering and refining information, similar to concepts in cognitive computing. A granularity metric is further employed to optimize subset size, maximizing its impact. GEK-NN consists of two crucial phases. Initially, an Elastic Net-based feature evaluation is conducted to pinpoint relevant features from the high-dimensional data. Subsequently, granularity-based optimization refines the subset size, adapting to the complexity of big data. Before applying to genomic big data, experiments on UCI datasets demonstrated the feasibility and effectiveness of GEK-NN. By using an Evidence Theory framework, GEK-NN overcomes feature-selection challenges in both low-dimensional UCI datasets and high-dimensional genomic big data, significantly enhancing pattern recognition and classification accuracy. Comparative analyses with existing EK-NN feature-selection methods, using both UCI and high-dimensional gene datasets, underscore GEK-NN’s superiority in handling big data for feature selection and classification. These results indicate that GEK-NN not only enriches EK-NN applications but also offers a cognitive-inspired solution for complex gene data analysis, effectively tackling high-dimensional feature-selection challenges in the realm of big data. Full article

► Show Figures

Figure 1

20 pages, 4472 KiB

Open AccessArticle

Exploring Scientific Collaboration Patterns from the Perspective of Disciplinary Difference: Evidence from Scientific Literature Data

by Jun Zhang, Shengbo Liu and Yifei Wang

Big Data Cogn. Comput. 2025, 9(8), 201; https://doi.org/10.3390/bdcc9080201 - 1 Aug 2025

Abstract

With the accelerating globalization and rapid development of science and technology, scientific collaboration has become a key driver of knowledge production, yet its patterns vary significantly across disciplines. This study aims to explore the disciplinary differences in scholars’ scientific collaboration patterns and their [...] Read more.

With the accelerating globalization and rapid development of science and technology, scientific collaboration has become a key driver of knowledge production, yet its patterns vary significantly across disciplines. This study aims to explore the disciplinary differences in scholars’ scientific collaboration patterns and their underlying mechanisms. Data were collected from the China National Knowledge Infrastructure (CNKI) database, covering papers from four disciplines: mathematics, mechanical engineering, philosophy, and sociology. Using social network analysis, we examined core network metrics (degree centrality, neighbor connectivity, clustering coefficient) in collaboration networks, analyzed collaboration patterns across scholars of different academic ages, and compared the academic age distribution of collaborators and network characteristics across career stages. Key findings include the following. (1) Mechanical engineering exhibits the highest and most stable clustering coefficient (mean 0.62) across all academic ages, reflecting tight team collaboration, with degree centrality increasing fastest with academic age (3.2 times higher for senior vs. beginner scholars), driven by its reliance on experimental resources and skill division. (2) Philosophy shows high degree centrality in early career stages (mean 0.38 for beginners) but a sharp decline in clustering coefficient in senior stages (from 0.42 to 0.17), indicating broad early collaboration but loose later ties due to individualized knowledge production. (3) Mathematics scholars prefer collaborating with high-centrality peers (higher neighbor connectivity, mean 0.51), while sociology shows more inclusive collaboration with dispersed partner centrality. Full article

► Show Figures

Figure 1

22 pages, 5581 KiB

Open AccessArticle

PruneEnergyAnalyzer: An Open-Source Toolkit for Evaluating Energy Consumption in Pruned Deep Learning Models

by Cesar Pachon, Cesar Pedraza and Dora Ballesteros

Big Data Cogn. Comput. 2025, 9(8), 200; https://doi.org/10.3390/bdcc9080200 - 1 Aug 2025

Abstract

Currently, various pruning strategies including different methods and distribution types are commonly used to reduce the number of FLOPs and parameters in deep learning models. However, their impact on actual energy savings remains insufficiently studied, particularly in resource-constrained settings. To address this, we [...] Read more.

Currently, various pruning strategies including different methods and distribution types are commonly used to reduce the number of FLOPs and parameters in deep learning models. However, their impact on actual energy savings remains insufficiently studied, particularly in resource-constrained settings. To address this, we introduce PruneEnergyAnalyzer, an open-source Python tool designed to evaluate the energy efficiency of pruned models. Starting from the unpruned model, the tool calculates the energy savings achieved by pruned versions provided by the user, and generates comparative visualizations based on previously applied pruning hyperparameters such as method, distribution type (PD), compression ratio (CR), and batch size. These visual outputs enable the identification of the most favorable pruning configurations in terms of FLOPs, parameter count, and energy consumption. As a demonstration, we evaluated the tool with 180 models generated from three architectures, five pruning distributions, three pruning methods, and four batch sizes, using another previous library (e.g. FlexiPrune). This experiment revealed the significant impact of the network architecture on Energy Reduction, the non-linearity between FLOPs savings and energy savings, as well as between parameter reduction and energy efficiency. It also showed that the batch size strongly influences the energy consumption of the pruned model. Therefore, this tool can support researchers in making pruning policy decisions that also take into account the energy efficiency of the pruned model. Full article

► Show Figures

Figure 1

16 pages, 1328 KiB

Open AccessArticle

Parsing Old English with Universal Dependencies—The Impacts of Model Architectures and Dataset Sizes

by Javier Martín Arista, Ana Elvira Ojanguren López and Sara Domínguez Barragán

Big Data Cogn. Comput. 2025, 9(8), 199; https://doi.org/10.3390/bdcc9080199 - 30 Jul 2025

Abstract

This study presents the first systematic empirical comparison of neural architectures for Universal Dependencies (UD) parsing in Old English, thus addressing central questions in computational historical linguistics and low-resource language processing. We evaluate three approaches—a baseline spaCy pipeline, a pipeline with a pretrained [...] Read more.

This study presents the first systematic empirical comparison of neural architectures for Universal Dependencies (UD) parsing in Old English, thus addressing central questions in computational historical linguistics and low-resource language processing. We evaluate three approaches—a baseline spaCy pipeline, a pipeline with a pretrained tok2vec component, and a MobileBERT transformer-based model—across datasets ranging from 1000 to 20,000 words. Our results demonstrate that the pretrained tok2vec model consistently outperforms alternatives, because it achieves 83.24% UAS and 74.23% LAS with the largest dataset, whereas the transformer-based approach substantially underperforms despite higher computational costs. Performance analysis reveals that basic tagging tasks reach 85–90% accuracy, while dependency parsing achieves approximately 75% accuracy. We identify critical scaling thresholds, with substantial improvements occurring between 1000 and 5000 words and diminishing returns beyond 10,000 words, which provides insights into scaling laws for historical languages. Technical analysis reveals that the poor performance of the transformer stems from parameter-to-data ratio mismatches (1250:1) and the unique orthographic and morphological characteristics of Old English. These findings defy assumptions about transformer superiority in low-resource scenarios and establish evidence-based guidelines for researchers working with historical languages. The broader significance of this study extends to enabling an automated analysis of three million words of extant Old English texts and providing a framework for optimal architecture selection in data-constrained environments. Our results suggest that medium-complexity architectures with monolingual pretraining offer superior cost–benefit trade-offs compared to complex transformer models for historical language processing. Full article

► Show Figures

Figure 1

21 pages, 937 KiB

Open AccessArticle

LAI: Label Annotation Interaction-Based Representation Enhancement for End to End Relation Extraction

by Rongxuan Lai, Wenhui Wu, Li Zou, Feifan Liao, Zhenyi Wang and Haibo Mi

Big Data Cogn. Comput. 2025, 9(8), 198; https://doi.org/10.3390/bdcc9080198 - 29 Jul 2025

Abstract

End-to-end relation extraction (E2ERE) generally performs named entity recognition and relation extraction either simultaneously or sequentially. While numerous studies on E2ERE have centered on enhancing span representations to improve model performance, challenges remain due to the gaps between subtasks (named entity recognition and [...] Read more.

End-to-end relation extraction (E2ERE) generally performs named entity recognition and relation extraction either simultaneously or sequentially. While numerous studies on E2ERE have centered on enhancing span representations to improve model performance, challenges remain due to the gaps between subtasks (named entity recognition and relation extraction) and the modeling discrepancies between entities and relations. In this paper, we propose a novel Label Annotation Interaction-based representation enhancement method for E2ERE, which institutes a two-phase semantic interaction to augment representations. Specifically, we firstly feed label annotations that are easy to manually annotate into a language model, and conduct the first-round interaction between three types of tokens with a partial attention mechanism; Then we construct a latent multi-view graph to capture various possible links between label and entity (pair) nodes, facilitating the second-round interaction between entities and labels. A series of comparative experiments with methods of various transformer-based architectures currently in use show that LAI-Net can maintain performance on par with the current SOTA in terms of NER task, and achieves significant improvements over existing SOTA models in terms of RE task. Full article

► Show Figures

Figure 1

27 pages, 1481 KiB

Open AccessArticle

Integration of Associative Tokens into Thematic Hyperspace: A Method for Determining Semantically Significant Clusters in Dynamic Text Streams

by Dmitriy Rodionov, Boris Lyamin, Evgenii Konnikov, Elena Obukhova, Gleb Golikov and Prokhor Polyakov

Big Data Cogn. Comput. 2025, 9(8), 197; https://doi.org/10.3390/bdcc9080197 - 25 Jul 2025

Abstract

With the exponential growth of textual data, traditional topic modeling methods based on static analysis demonstrate limited effectiveness in tracking the dynamics of thematic content. This research aims to develop a method for quantifying the dynamics of topics within text corpora using a [...] Read more.

With the exponential growth of textual data, traditional topic modeling methods based on static analysis demonstrate limited effectiveness in tracking the dynamics of thematic content. This research aims to develop a method for quantifying the dynamics of topics within text corpora using a thematic signal (TS) function that accounts for temporal changes and semantic relationships. The proposed method combines associative tokens with original lexical units to reduce thematic entropy and information noise. Approaches employed include topic modeling (LDA), vector representations of texts (TF-IDF, Word2Vec), and time series analysis. The method was tested on a corpus of news texts (5000 documents). Results demonstrated robust identification of semantically meaningful thematic clusters. An inverse relationship was observed between the level of thematic significance and semantic diversity, confirming a reduction in entropy using the proposed method. This approach allows for quantifying topic dynamics, filtering noise, and determining the optimal number of clusters. Future applications include analyzing multilingual data and integration with neural network models. The method shows potential for monitoring information flows and predicting thematic trends. Full article

► Show Figures

Figure 1

Journal Description

Big Data and Cognitive Computing

Latest Articles

Journal Menu

Journal Browser

Highly Accessed Articles

Latest Books

E-Mail Alert

News

Topics

Conferences

Special Issues

Further Information

Guidelines

MDPI Initiatives

Follow MDPI