Data Mining and Machine Learning in the Era of Big Knowledge and Large Models

A special issue of Mathematics (ISSN 2227-7390). This special issue belongs to the section "E1: Mathematics and Computer Science".

Deadline for manuscript submissions: 31 July 2025 | Viewed by 16093

Special Issue Editors


E-Mail Website
Guest Editor
School of Cyber Science and Engineering, Southeast University, Nanjing, China
Interests: trustworthy artificial intelligence; federated learning and graph representation learning; large models and content security
Special Issues, Collections and Topics in MDPI journals

E-Mail Website
Guest Editor
Ministry of Education Key Laboratory of Knowledge Engineering with Big Data, School of Computer Science and Information Engineering, Hefei University of Technology, Hefei 230009, China
Interests: knowledge graph construction; multimodal fusion; AutoGL

Special Issue Information

Dear Colleagues,

In recent years, the fields of data mining and machine learning have undergone a transformative shift with the advent of big data and the development of increasingly complex models. The integration of advanced algorithms with vast amounts of data has enabled unprecedented insights and predictions across various domains including healthcare, finance, marketing, and more. This Special Issue aims to explore the latest advancements, challenges, and applications in data mining and machine learning within the context of big knowledge and large models.

We invite researchers to contribute original research articles, reviews, and short communications on topics related to data mining and machine learning, including but not limited to the following:

  • Scalable algorithms for big data analysis;
  • Deep learning architectures and techniques;
  • Transfer learning and domain adaptation in large-scale models;
  • Federated learning and distributed machine learning systems;
  • Explainable AI and interpretable machine learning models;
  • Privacy-preserving data mining and machine learning;
  • Secure and trustworthy machine learning;
  • Reinforcement learning in complex environments;
  • Meta-learning and automated machine learning;
  • Optimization techniques for large-scale models;
  • Applications of data mining and machine learning in real-world scenarios.

We encourage submissions that present novel methodologies, theoretical insights, experimental results, and practical applications that push the boundaries of data mining and machine learning in the era of big knowledge and large models.

Prof. Dr. Jing Zhang
Dr. Chenyang Bu
Guest Editors

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Mathematics is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2600 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • machine learning
  • data mining
  • deep learning
  • big data analytics
  • knowledge graph
  • large language models

Benefits of Publishing in a Special Issue

  • Ease of navigation: Grouping papers by topic helps scholars navigate broad scope journals more efficiently.
  • Greater discoverability: Special Issues support the reach and impact of scientific research. Articles in Special Issues are more discoverable and cited more frequently.
  • Expansion of research network: Special Issues facilitate connections among authors, fostering scientific collaborations.
  • External promotion: Articles in Special Issues are often promoted through the journal's social media, increasing their visibility.
  • e-Book format: Special Issues with more than 10 articles can be published as dedicated e-books, ensuring wide and rapid dissemination.

Further information on MDPI's Special Issue policies can be found here.

Published Papers (11 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

Jump to: Review

27 pages, 5888 KiB  
Article
Advanced Trans-EEGNet Deep Learning Model for Hypoxic-Ischemic Encephalopathy Severity Grading
by Dong-Her Shih, Feng-I Chung, Ting-Wei Wu, Shuo-Yu Huang and Ming-Hung Shih
Mathematics 2024, 12(24), 3915; https://doi.org/10.3390/math12243915 - 12 Dec 2024
Viewed by 872
Abstract
Hypoxic-ischemic encephalopathy (HIE) is a brain injury condition that poses a significant risk to newborns, potentially causing varying degrees of damage to the central nervous system. Its clinical manifestations include respiratory distress, cardiac dysfunction, hypotension, muscle weakness, seizures, and coma. As HIE represents [...] Read more.
Hypoxic-ischemic encephalopathy (HIE) is a brain injury condition that poses a significant risk to newborns, potentially causing varying degrees of damage to the central nervous system. Its clinical manifestations include respiratory distress, cardiac dysfunction, hypotension, muscle weakness, seizures, and coma. As HIE represents a progressive brain injury, early identification of the extent of the damage and the implementation of appropriate treatment are crucial for reducing mortality and improving outcomes. HIE patients may face long-term complications such as cerebral palsy, epilepsy, vision loss, and developmental delays. Therefore, prompt identification and treatment of hypoxic-ischemic symptoms can help reduce the risk of severe sequelae in patients. Currently, hypothermia therapy is one of the most effective treatments for HIE patients. However, not all newborns with HIE are suitable for this therapy, making rapid and accurate assessment of the extent of brain injury critical for treatment. Among HIE patients, hypothermia therapy has shown better efficacy in those diagnosed with moderate to severe HIE within 6 h of birth, establishing this time frame as the golden period for treatment. During this golden period, an accurate assessment of HIE severity is essential for formulating appropriate treatment strategies and predicting long-term outcomes for the affected infants. This study proposes a method for addressing data imbalance and noise interference through data preprocessing techniques, including filtering and SMOTE. It then employs EEGNet, a deep learning model specifically designed for EEG classification, combined with a Transformer model featuring an attention mechanism that excels at capturing long-term sequential features to construct the Trans-EEGNet model. This model outperforms previous methods in computation time and feature extraction, enabling rapid classification and assessment of HIE severity in newborns. Full article
Show Figures

Figure 1

34 pages, 5016 KiB  
Article
Advanced Trans-BiGRU-QA Fusion Model for Atmospheric Mercury Prediction
by Dong-Her Shih, Feng-I. Chung, Ting-Wei Wu, Bo-Hao Wang and Ming-Hung Shih
Mathematics 2024, 12(22), 3547; https://doi.org/10.3390/math12223547 - 13 Nov 2024
Viewed by 816
Abstract
With the deepening of the Industrial Revolution and the rapid development of the chemical industry, the large-scale emissions of corrosive dust and gases from numerous factories have become a significant source of air pollution. Mercury in the atmosphere, identified by the United Nations [...] Read more.
With the deepening of the Industrial Revolution and the rapid development of the chemical industry, the large-scale emissions of corrosive dust and gases from numerous factories have become a significant source of air pollution. Mercury in the atmosphere, identified by the United Nations Environment Programme (UNEP) as one of the globally concerning air pollutants, has been proven to pose a threat to the human environment with potential carcinogenic risks. Therefore, accurately predicting atmospheric mercury concentration is of critical importance. This study proposes a novel advanced model—the Trans-BiGRU-QA hybrid—designed to predict the atmospheric mercury concentration accurately. Methodology includes feature engineering techniques to extract relevant features and applies a sliding window technique for time series data preprocessing. Furthermore, the proposed Trans-BiGRU-QA model is compared to other deep learning models, such as GRU, LSTM, RNN, Transformer, BiGRU, and Trans-BiGRU. This study utilizes air quality data from Vietnam to train and test the models, evaluating their performance in predicting atmospheric mercury concentration. The results show that the Trans-BiGRU-QA model performed exceptionally well in terms of Mean Absolute Error (MAE), Root Mean Square Error (RMSE), and R-squared (R2), demonstrating high accuracy and robustness. Compared to other deep learning models, the Trans-BiGRU-QA model exhibited significant advantages, indicating its broad potential for application in environmental pollution prediction. Full article
Show Figures

Figure 1

21 pages, 2469 KiB  
Article
Fortify the Guardian, Not the Treasure: Resilient Adversarial Detectors
by Raz Lapid, Almog Dubin and Moshe Sipper
Mathematics 2024, 12(22), 3451; https://doi.org/10.3390/math12223451 - 5 Nov 2024
Cited by 1 | Viewed by 1182
Abstract
Adaptive adversarial attacks, where adversaries tailor their strategies with full knowledge of defense mechanisms, pose significant challenges to the robustness of adversarial detectors. In this paper, we introduce RADAR (Robust Adversarial Detection via Adversarial Retraining), an approach designed to fortify adversarial detectors against [...] Read more.
Adaptive adversarial attacks, where adversaries tailor their strategies with full knowledge of defense mechanisms, pose significant challenges to the robustness of adversarial detectors. In this paper, we introduce RADAR (Robust Adversarial Detection via Adversarial Retraining), an approach designed to fortify adversarial detectors against such adaptive attacks while preserving the classifier’s accuracy. RADAR employs adversarial training by incorporating adversarial examples—crafted to deceive both the classifier and the detector—into the training process. This dual optimization enables the detector to learn and adapt to sophisticated attack scenarios. Comprehensive experiments on CIFAR-10, SVHN, and ImageNet datasets demonstrate that RADAR substantially enhances the detector’s ability to accurately identify adaptive adversarial attacks without degrading classifier performance. Full article
Show Figures

Figure 1

17 pages, 791 KiB  
Article
Pre-Trained Language Model Ensemble for Arabic Fake News Detection
by Lama Al-Zahrani and Maha Al-Yahya
Mathematics 2024, 12(18), 2941; https://doi.org/10.3390/math12182941 - 21 Sep 2024
Cited by 1 | Viewed by 1607
Abstract
Fake news detection (FND) remains a challenge due to its vast and varied sources, especially on social media platforms. While numerous attempts have been made by academia and the industry to develop fake news detection systems, research on Arabic content remains limited. This [...] Read more.
Fake news detection (FND) remains a challenge due to its vast and varied sources, especially on social media platforms. While numerous attempts have been made by academia and the industry to develop fake news detection systems, research on Arabic content remains limited. This study investigates transformer-based language models for Arabic FND. While transformer-based models have shown promising performance in various natural language processing tasks, they often struggle with tasks involving complex linguistic patterns and cultural contexts, resulting in unreliable performance and misclassification problems. To overcome these challenges, we investigated an ensemble of transformer-based models. We experimented with five Arabic transformer models: AraBERT, MARBERT, AraELECTRA, AraGPT2, and ARBERT. Various ensemble approaches, including a weighted-average ensemble, hard voting, and soft voting, were evaluated to determine the most effective techniques for boosting learning models and improving prediction accuracies. The results of this study demonstrate the effectiveness of ensemble models in significantly boosting the baseline model performance. An important finding is that ensemble models achieved excellent performance on the Arabic Multisource Fake News Detection (AMFND) dataset, reaching an F1 score of 94% using weighted averages. Moreover, changing the number of models in the ensemble has a slight effect on the performance. These key findings contribute to the advancement of fake news detection in Arabic, offering valuable insights for both academia and the industry Full article
Show Figures

Figure 1

14 pages, 505 KiB  
Article
Few-Shot Learning Sensitive Recognition Method Based on Prototypical Network
by Guoquan Yuan, Xinjian Zhao, Liu Li, Song Zhang and Shanming Wei
Mathematics 2024, 12(17), 2791; https://doi.org/10.3390/math12172791 - 9 Sep 2024
Viewed by 1059
Abstract
Traditional machine learning-based entity extraction methods rely heavily on feature engineering by experts, and the generalization ability of the model is poor. Prototype networks, on the other hand, can effectively use a small amount of labeled data to train models while using category [...] Read more.
Traditional machine learning-based entity extraction methods rely heavily on feature engineering by experts, and the generalization ability of the model is poor. Prototype networks, on the other hand, can effectively use a small amount of labeled data to train models while using category prototypes to enhance the generalization ability of the models. Therefore, this paper proposes a prototype network-based named entity recognition (NER) method, namely the FSPN-NER model, to solve the problem of difficult recognition of sensitive data in data-sparse text. The model utilizes the positional coding model (PCM) to pre-train the data and perform feature extraction, then computes the prototype vectors to achieve entity matching, and finally introduces a boundary detection module to enhance the performance of the prototype network in the named entity recognition task. The model in this paper is compared with LSTM, BiLSTM, CRF, Transformer and their combination models, and the experimental results on the test dataset show that the model outperforms the comparative models with an accuracy of 84.8%, a recall of 85.8% and an F1 value of 0.853. Full article
Show Figures

Figure 1

19 pages, 2828 KiB  
Article
KCB-FLAT: Enhancing Chinese Named Entity Recognition with Syntactic Information and Boundary Smoothing Techniques
by Zhenrong Deng, Zheng Huang, Shiwei Wei and Jinglin Zhang
Mathematics 2024, 12(17), 2714; https://doi.org/10.3390/math12172714 - 30 Aug 2024
Cited by 1 | Viewed by 923
Abstract
Named entity recognition (NER) is a fundamental task in Natural Language Processing (NLP). During the training process, NER models suffer from over-confidence, and especially for the Chinese NER task, it involves word segmentation and introduces erroneous entity boundary segmentation, exacerbating over-confidence and reducing [...] Read more.
Named entity recognition (NER) is a fundamental task in Natural Language Processing (NLP). During the training process, NER models suffer from over-confidence, and especially for the Chinese NER task, it involves word segmentation and introduces erroneous entity boundary segmentation, exacerbating over-confidence and reducing the model’s overall performance. These issues limit further enhancement of NER models. To tackle these problems, we proposes a new model named KCB-FLAT, designed to enhance Chinese NER performance by integrating enriched semantic information with the word-Boundary Smoothing technique. Particularly, we first extract various types of syntactic data and utilize a network named Key-Value Memory Network, based on syntactic information to functionalize this, integrating it through an attention mechanism to generate syntactic feature embeddings for Chinese characters. Subsequently, we employed an encoder named Cross-Transformer to thoroughly combine syntactic and lexical information to address the entity boundary segmentation errors caused by lexical information. Finally, we introduce a Boundary Smoothing module, combined with a regularity-conscious function, to capture the internal regularity of per entity, reducing the model’s overconfidence in entity probabilities through smoothing. Experimental results demonstrate that the proposed model achieves exceptional performance on the MSRA, Resume, Weibo, and self-built ZJ datasets, as verified by the F1 score. Full article
Show Figures

Figure 1

16 pages, 2449 KiB  
Article
Enhancing Knowledge-Concept Recommendations with Heterogeneous Graph-Contrastive Learning
by Liting Wei, Yun Li, Weiwei Wang and Yi Zhu
Mathematics 2024, 12(15), 2324; https://doi.org/10.3390/math12152324 - 25 Jul 2024
Viewed by 988
Abstract
With the implementation of conceptual labeling on online learning resources, knowledge-concept recommendations have been introduced to pinpoint concepts that learners may wish to delve into more deeply. As the core subject of learning, learners’ preferences in knowledge concepts should be given greater attention. [...] Read more.
With the implementation of conceptual labeling on online learning resources, knowledge-concept recommendations have been introduced to pinpoint concepts that learners may wish to delve into more deeply. As the core subject of learning, learners’ preferences in knowledge concepts should be given greater attention. Research indicates that learners’ preferences for knowledge concepts are influenced by the characteristics of their group structure. There is a high degree of homogeneity within a group, and notable distinctions exist between the internal and external configurations of a group. To strengthen the group-structure characteristics of learners’ behaviors, a multi-task strategy for knowledge-concept recommendations is proposed; this strategy is called Knowledge-Concept Recommendations with Heterogeneous Graph-Contrastive Learning. Specifically, due to the difficulty of accessing authentic social networks, learners and their structural neighbors are considered positive contrastive pairs to construct self-supervision signals on the predefined meta-path from heterogeneous information networks as auxiliary tasks, which capture the higher-order neighbors of learners by presenting different perspectives. Then, the Information Noise-Contrastive Estimation loss is regarded as the main training objective to increase the differentiation of learners from different professional backgrounds. Extensive experiments are constructed on MOOCCube, and we find that our proposed method outperforms the other state-of-the-art concept-recommendation methods, achieving 6.66% with HR@5, 8.85% with NDCG@5, and 8.68% with MRR. Full article
Show Figures

Figure 1

15 pages, 693 KiB  
Article
DABC: A Named Entity Recognition Method Incorporating Attention Mechanisms
by Fangling Leng, Fan Li, Yubin Bao, Tiancheng Zhang and Ge Yu
Mathematics 2024, 12(13), 1992; https://doi.org/10.3390/math12131992 - 27 Jun 2024
Cited by 1 | Viewed by 1124
Abstract
Regarding the existing models for feature extraction of complex similar entities, there are problems in the utilization of relative position information and the ability of key feature extraction. The distinctiveness of Chinese named entity recognition compared to English lies in the absence of [...] Read more.
Regarding the existing models for feature extraction of complex similar entities, there are problems in the utilization of relative position information and the ability of key feature extraction. The distinctiveness of Chinese named entity recognition compared to English lies in the absence of space delimiters, significant polysemy and homonymy of characters, diverse and common names, and a greater reliance on complex contextual and linguistic structures. An entity recognition method based on DeBERTa-Attention-BiLSTM-CRF (DABC) is proposed. Firstly, the feature extraction capability of the DeBERTa model is utilized to extract the data features; then, the attention mechanism is introduced to further enhance the extracted features; finally, BiLSTM is utilized to further capture the long-distance dependencies in the text and obtain the predicted sequences through the CRF layer, and then the entities in the text are identified. The proposed model is applied to the dataset for validation. The experiments show that the precision (P) of the proposed DABC model on the dataset reaches 88.167%, the recall (R) reaches 83.121%, and the F1 value reaches 85.024%. Compared with other models, the F1 value improves by 3∼5%, and the superiority of the model is verified. In the future, it can be extended and applied to recognize complex entities in more fields. Full article
Show Figures

Figure 1

18 pages, 1039 KiB  
Article
A Novel Online Hydrological Data Quality Control Approach Based on Adaptive Differential Evolution
by Qun Zhao, Shicheng Cui, Yuelong Zhu, Rui Li and Xudong Zhou
Mathematics 2024, 12(12), 1821; https://doi.org/10.3390/math12121821 - 12 Jun 2024
Cited by 1 | Viewed by 702
Abstract
The quality of hydrological data has a significant impact on hydrological models, where stable and anomaly-free hydrological time series typically yield more valuable patterns. In this paper, we conduct data analysis and propose an online hydrological data quality control method based on an [...] Read more.
The quality of hydrological data has a significant impact on hydrological models, where stable and anomaly-free hydrological time series typically yield more valuable patterns. In this paper, we conduct data analysis and propose an online hydrological data quality control method based on an adaptive differential evolution algorithm according to the characteristics of hydrological data. Taking into account the characteristics of continuity, periodicity, and seasonality, we develop a Periodic Temporal Long Short-Term Memory (PT-LSTM) predictive control model. Building upon the real-time nature of the data, we apply the Adaptive Differential Evolution algorithm to optimize PT-LSTM, creating an Online Composite Predictive Control Model (OCPT-LSTM) that provides confidence intervals and recommended values for control and replacement. The experimental results demonstrate that the proposed data quality control method effectively manages data quality; detects data anomalies; provides suggested values; reduces reliance on manual intervention; provides a solid data foundation for hydrological data analysis work; and helps hydrological personnel in water resource scheduling, flood control, and other related tasks. Meanwhile, the proposed method can also be applied to the analysis of time series data in other industries. Full article
Show Figures

Figure 1

14 pages, 2003 KiB  
Article
Abnormal Traffic Detection System Based on Feature Fusion and Sparse Transformer
by Xinjian Zhao, Weiwei Miao, Guoquan Yuan, Yu Jiang, Song Zhang and Qianmu Li
Mathematics 2024, 12(11), 1643; https://doi.org/10.3390/math12111643 - 24 May 2024
Viewed by 1389
Abstract
This paper presents a feature fusion and sparse transformer-based anomalous traffic detection system (FSTDS). FSTDS utilizes a feature fusion network to encode the traffic data sequences and extracting features, fusing them into coding vectors through shallow and deep convolutional networks, followed by deep [...] Read more.
This paper presents a feature fusion and sparse transformer-based anomalous traffic detection system (FSTDS). FSTDS utilizes a feature fusion network to encode the traffic data sequences and extracting features, fusing them into coding vectors through shallow and deep convolutional networks, followed by deep coding using a sparse transformer to capture the complex relationships between network flows; finally, a multilayer perceptron is used to classify the traffic and achieve anomaly traffic detection. The feature fusion network of FSTDS improves feature extraction from small sample data, the deep encoder enhances the understanding of complex traffic patterns, and the sparse transformer reduces the computational and storage overhead and improves the scalability of the model. Experiments demonstrate that the number of FSTDS parameters is reduced by up to nearly half compared to the baseline, and the success rate of anomalous flow detection is close to 100%. Full article
Show Figures

Figure 1

Review

Jump to: Research

33 pages, 626 KiB  
Review
Hallucination Mitigation for Retrieval-Augmented Large Language Models: A Review
by Wan Zhang and Jing Zhang
Mathematics 2025, 13(5), 856; https://doi.org/10.3390/math13050856 - 4 Mar 2025
Viewed by 3731
Abstract
Retrieval-augmented generation (RAG) leverages the strengths of information retrieval and generative models to enhance the handling of real-time and domain-specific knowledge. Despite its advantages, limitations within RAG components may cause hallucinations, or more precisely termed confabulations in generated outputs, driving extensive research to [...] Read more.
Retrieval-augmented generation (RAG) leverages the strengths of information retrieval and generative models to enhance the handling of real-time and domain-specific knowledge. Despite its advantages, limitations within RAG components may cause hallucinations, or more precisely termed confabulations in generated outputs, driving extensive research to address these limitations and mitigate hallucinations. This review focuses on hallucination in retrieval-augmented large language models (LLMs). We first examine the causes of hallucinations from different sub-tasks in the retrieval and generation phases. Then, we provide a comprehensive overview of corresponding hallucination mitigation techniques, offering a targeted and complete framework for addressing hallucinations in retrieval-augmented LLMs. We also investigate methods to reduce the impact of hallucination through detection and correction. Finally, we discuss promising future research directions for mitigating hallucinations in retrieval-augmented LLMs. Full article
Show Figures

Figure 1

Back to TopTop