Machine Learning in Data Mining for Knowledge Discovery

A special issue of Big Data and Cognitive Computing (ISSN 2504-2289).

Deadline for manuscript submissions: closed (30 June 2024) | Viewed by 15519

Special Issue Editors

Department of Computer Science, University of Regina, Regina, SK S4S 0A2, Canada
Interests: machine learning; data mining; rough sets; lie group machine learning; three-way decisions

E-Mail Website
Guest Editor
School of Computer and Information Technology, Beijing Jiaotong University, Beijing 100081, China
Interests: Big Data; multi-task learning; cognitive service

Special Issue Information

Dear Colleagues,

It is our pleasure to announce a new Special Issue of the journal Big Data and Cognitive Computing titled “Machine Learning in Data Mining for Knowledge Discovery”.

Data mining was introduced in the 1930s as a strategy for obtaining knowledge from data. Starting from uncleaned and unstructured data and ultimately yielding useful knowledge (e.g., patterns, rules, and any other entities), there are a few steps involved. While machine learning has been applied to each step of the data processing, due to reasons such as the explosion of the volume of data, the distribution of data, data sparsity, and partially missing or invalid data, generic data mining technologies may need to be adapted and new approaches are desired in the Big Data era. The challenge is that with most machine learning approaches (specifically non-symbolic approaches), although they have significant performance, the result might not be understandable by humans and thus might be difficult to apply in practice. On the other hand, symbolic methods are suitable in occasions where people are more interested in the form of knowledge that can be easily understood and thus further infer practical actions manually. Another area that has not been given enough attention is how to utilize the mined patterns and rules and evaluate the benefits and costs of these in practice. Furthermore, synthetic studies combining both symbolic and non-symbolic approaches might be a potential direction that improves both human understanding and usability.

In this Special Issue, original research articles and reviews are welcome. Research areas may include (but are not limited to) the following:

  1. Association rules mining in Big Data;
  2. Classification rules mining in Big Data;
  3. Unstructured data cleansing;
  4. Distributed data mining;
  5. Adaptation of deep learning approaches to data mining;
  6. Actionable rules mining in Big Data;
  7. Neural networks for knowledge discovery;
  8. Generic machine learning model studies in data mining;
  9. Knowledge evaluation and quantization;
  10. Utility analysis for rules and patterns;
  11. Cost–benefit analysis for rules and patterns;
  12. Redundant data removal and its quality evaluation;
  13. Attribute reduction;
  14. Social network discovery;
  15. Data integration and data fusion from a variety of sources.

Dr. Cong Gao
Dr. Chuntao Ding
Guest Editors

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Big Data and Cognitive Computing is an international peer-reviewed open access monthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 1800 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • data mining
  • Big Data
  • machine learning
  • association rules
  • actionable rules
  • expected pattern
  • knowledge discovery
  • attribute reduction
  • pattern evaluation

Benefits of Publishing in a Special Issue

  • Ease of navigation: Grouping papers by topic helps scholars navigate broad scope journals more efficiently.
  • Greater discoverability: Special Issues support the reach and impact of scientific research. Articles in Special Issues are more discoverable and cited more frequently.
  • Expansion of research network: Special Issues facilitate connections among authors, fostering scientific collaborations.
  • External promotion: Articles in Special Issues are often promoted through the journal's social media, increasing their visibility.
  • e-Book format: Special Issues with more than 10 articles can be published as dedicated e-books, ensuring wide and rapid dissemination.

Further information on MDPI's Special Issue polices can be found here.

Published Papers (5 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

19 pages, 7056 KiB  
Article
A Data-Centric Approach to Understanding the 2020 U.S. Presidential Election
by Satish Mahadevan Srinivasan and Yok-Fong Paat
Big Data Cogn. Comput. 2024, 8(9), 111; https://doi.org/10.3390/bdcc8090111 - 4 Sep 2024
Viewed by 784
Abstract
The application of analytics on Twitter feeds is a very popular field for research. A tweet with a 280-character limitation can reveal a wealth of information on how individuals express their sentiments and emotions within their network or community. Upon collecting, cleaning, and [...] Read more.
The application of analytics on Twitter feeds is a very popular field for research. A tweet with a 280-character limitation can reveal a wealth of information on how individuals express their sentiments and emotions within their network or community. Upon collecting, cleaning, and mining tweets from different individuals on a particular topic, we can capture not only the sentiments and emotions of an individual but also the sentiments and emotions expressed by a larger group. Using the well-known Lexicon-based NRC classifier, we classified nearly seven million tweets across seven battleground states in the U.S. to understand the emotions and sentiments expressed by U.S. citizens toward the 2020 presidential candidates. We used the emotions and sentiments expressed within these tweets as proxies for their votes and predicted the swing directions of each battleground state. When compared to the outcome of the 2020 presidential candidates, we were able to accurately predict the swing directions of four battleground states (Arizona, Michigan, Texas, and North Carolina), thus revealing the potential of this approach in predicting future election outcomes. The week-by-week analysis of the tweets using the NRC classifier corroborated well with the various political events that took place before the election, making it possible to understand the dynamics of the emotions and sentiments of the supporters in each camp. These research strategies and evidence-based insights may be translated into real-world settings and practical interventions to improve election outcomes. Full article
(This article belongs to the Special Issue Machine Learning in Data Mining for Knowledge Discovery)
Show Figures

Figure 1

15 pages, 680 KiB  
Article
Automating Feature Extraction from Entity-Relation Models: Experimental Evaluation of Machine Learning Methods for Relational Learning
by Boris Stanoev, Goran Mitrov, Andrea Kulakov, Georgina Mirceva, Petre Lameski and Eftim Zdravevski
Big Data Cogn. Comput. 2024, 8(4), 39; https://doi.org/10.3390/bdcc8040039 - 1 Apr 2024
Cited by 1 | Viewed by 1891
Abstract
With the exponential growth of data, extracting actionable insights becomes resource-intensive. In many organizations, normalized relational databases store a significant portion of this data, where tables are interconnected through some relations. This paper explores relational learning, which involves joining and merging database tables, [...] Read more.
With the exponential growth of data, extracting actionable insights becomes resource-intensive. In many organizations, normalized relational databases store a significant portion of this data, where tables are interconnected through some relations. This paper explores relational learning, which involves joining and merging database tables, often normalized in the third normal form. The subsequent processing includes extracting features and utilizing them in machine learning (ML) models. In this paper, we experiment with the propositionalization algorithm (i.e., Wordification) for feature engineering. Next, we compare the algorithms PropDRM and PropStar, which are designed explicitly for multi-relational data mining, to traditional machine learning algorithms. Based on the performed experiments, we concluded that Gradient Boost, compared to PropDRM, achieves similar performance (F1 score, accuracy, and AUC) on multiple datasets. PropStar consistently underperformed on some datasets while being comparable to the other algorithms on others. In summary, the propositionalization algorithm for feature extraction makes it feasible to apply traditional ML algorithms for relational learning directly. In contrast, approaches tailored specifically for relational learning still face challenges in scalability, interpretability, and efficiency. These findings have a practical impact that can help speed up the adoption of machine learning in business contexts where data is stored in relational format without requiring domain-specific feature extraction. Full article
(This article belongs to the Special Issue Machine Learning in Data Mining for Knowledge Discovery)
Show Figures

Figure 1

21 pages, 2309 KiB  
Communication
Sentiment Analysis and Text Analysis of the Public Discourse on Twitter about COVID-19 and MPox
by Nirmalya Thakur
Big Data Cogn. Comput. 2023, 7(2), 116; https://doi.org/10.3390/bdcc7020116 - 9 Jun 2023
Cited by 22 | Viewed by 4334
Abstract
Mining and analysis of the big data of Twitter conversations have been of significant interest to the scientific community in the fields of healthcare, epidemiology, big data, data science, computer science, and their related areas, as can be seen from several works in [...] Read more.
Mining and analysis of the big data of Twitter conversations have been of significant interest to the scientific community in the fields of healthcare, epidemiology, big data, data science, computer science, and their related areas, as can be seen from several works in the last few years that focused on sentiment analysis and other forms of text analysis of tweets related to Ebola, E-Coli, Dengue, Human Papillomavirus (HPV), Middle East Respiratory Syndrome (MERS), Measles, Zika virus, H1N1, influenza-like illness, swine flu, flu, Cholera, Listeriosis, cancer, Liver Disease, Inflammatory Bowel Disease, kidney disease, lupus, Parkinson’s, Diphtheria, and West Nile virus. The recent outbreaks of COVID-19 and MPox have served as “catalysts” for Twitter usage related to seeking and sharing information, views, opinions, and sentiments involving both of these viruses. None of the prior works in this field analyzed tweets focusing on both COVID-19 and MPox simultaneously. To address this research gap, a total of 61,862 tweets that focused on MPox and COVID-19 simultaneously, posted between 7 May 2022 and 3 March 2023, were studied. The findings and contributions of this study are manifold. First, the results of sentiment analysis using the VADER (Valence Aware Dictionary for sEntiment Reasoning) approach shows that nearly half the tweets (46.88%) had a negative sentiment. It was followed by tweets that had a positive sentiment (31.97%) and tweets that had a neutral sentiment (21.14%), respectively. Second, this paper presents the top 50 hashtags used in these tweets. Third, it presents the top 100 most frequently used words in these tweets after performing tokenization, removal of stopwords, and word frequency analysis. The findings indicate that tweets in this context included a high level of interest regarding COVID-19, MPox and other viruses, President Biden, and Ukraine. Finally, a comprehensive comparative study that compares the contributions of this paper with 49 prior works in this field is presented to further uphold the relevance and novelty of this work. Full article
(This article belongs to the Special Issue Machine Learning in Data Mining for Knowledge Discovery)
Show Figures

Figure 1

19 pages, 8780 KiB  
Article
Effect of Missing Data Types and Imputation Methods on Supervised Classifiers: An Evaluation Study
by Menna Ibrahim Gabr, Yehia Mostafa Helmy and Doaa Saad Elzanfaly
Big Data Cogn. Comput. 2023, 7(1), 55; https://doi.org/10.3390/bdcc7010055 - 22 Mar 2023
Cited by 5 | Viewed by 3026
Abstract
Data completeness is one of the most common challenges that hinder the performance of data analytics platforms. Different studies have assessed the effect of missing values on different classification models based on a single evaluation metric, namely, accuracy. However, accuracy on its own [...] Read more.
Data completeness is one of the most common challenges that hinder the performance of data analytics platforms. Different studies have assessed the effect of missing values on different classification models based on a single evaluation metric, namely, accuracy. However, accuracy on its own is a misleading measure of classifier performance because it does not consider unbalanced datasets. This paper presents an experimental study that assesses the effect of incomplete datasets on the performance of five classification models. The analysis was conducted with different ratios of missing values in six datasets that vary in size, type, and balance. Moreover, for unbiased analysis, the performance of the classifiers was measured using three different metrics, namely, the Matthews correlation coefficient (MCC), the F1-score, and accuracy. The results show that the sensitivity of the supervised classifiers to missing data differs according to a set of factors. The most significant factor is the missing data pattern and ratio, followed by the imputation method, and then the type, size, and balance of the dataset. The sensitivity of the classifiers when data are missing due to the Missing Completely At Random (MCAR) pattern is less than their sensitivity when data are missing due to the Missing Not At Random (MNAR) pattern. Furthermore, using the MCC as an evaluation measure better reflects the variation in the sensitivity of the classifiers to the missing data. Full article
(This article belongs to the Special Issue Machine Learning in Data Mining for Knowledge Discovery)
Show Figures

Figure 1

23 pages, 3933 KiB  
Article
Deep Clustering-Based Anomaly Detection and Health Monitoring for Satellite Telemetry
by Muhamed Abdulhadi Obied, Fayed F. M. Ghaleb, Aboul Ella Hassanien, Ahmed M. H. Abdelfattah and Wael Zakaria
Big Data Cogn. Comput. 2023, 7(1), 39; https://doi.org/10.3390/bdcc7010039 - 22 Feb 2023
Cited by 7 | Viewed by 3866
Abstract
Satellite telemetry data plays an ever-important role in both the safety and the reliability of a satellite. These two factors are extremely significant in the field of space systems and space missions. Since it is challenging to repair space systems in orbit, health [...] Read more.
Satellite telemetry data plays an ever-important role in both the safety and the reliability of a satellite. These two factors are extremely significant in the field of space systems and space missions. Since it is challenging to repair space systems in orbit, health monitoring and early anomaly detection approaches are crucial for the success of space missions. A large number of efficient and accurate methods for health monitoring and anomaly detection have been proposed in aerospace systems but without showing enough concern for the patterns that can be mined from normal operational telemetry data. Concerning this, the present paper proposes DCLOP, an intelligent Deep Clustering-based Local Outlier Probabilities approach that aims at detecting anomalies alongside extracting realistic and reasonable patterns from the normal operational telemetry data. The proposed approach combines (i) a new deep clustering method that uses a dynamically weighted loss function with (ii) the adapted version of Local Outlier Probabilities based on the results of deep clustering. The DCLOP approach effectively monitors the health status of a spacecraft and detects the early warnings of its on-orbit failures. Therefore, this approach enhances the validity and accuracy of anomaly detection systems. The performance of the suggested approach is assessed using actual cube satellite telemetry data. The experimental findings prove that the suggested approach is competitive to the currently used techniques in terms of effectiveness, viability, and validity. Full article
(This article belongs to the Special Issue Machine Learning in Data Mining for Knowledge Discovery)
Show Figures

Figure 1

Back to TopTop