Data Mining & Machine Learning Techniques for the Analysis of Stream Data

A special issue of Information (ISSN 2078-2489). This special issue belongs to the section "Information Processes".

Deadline for manuscript submissions: closed (21 December 2023) | Viewed by 18004

Special Issue Editor

Department of Industrial Engineering, Hanyang University, Seoul 04763, Republic of Korea
Interests: social data mining; bio data mining; bioinformatics with statistical learning; time series; computational and wavelet methods
Special Issues, Collections and Topics in MDPI journals

Special Issue Information

Dear Colleagues,

Recently, stream data can easily be found and stored on a nearly real-time basis in a variety of domains such as sensor networks, e-commerce, financial systems, and social network industry, to name a few. The rise and abundance of stream data have increased the need of online learning or incremental learning in data mining and machine learning. Traditional methods in data mining and machine learning are hardly coping with the large amount of real-time data. Numerous data mining and machine learning methods have found their ways in the tasks of regression, classification, recommendation, and outlier detection.

While such methods in data mining and machine learning have been beneficial for offline data, usually by batch learning, when dealing with stream data, they have also highlighted a number of shortcomings, such as the restriction of memory, the speed of model update, the inability to guarantee the best model, the issue of window size, and so forth. In addition, the presence of high-speed stream data makes traditional performance measures, such as accuracy and F1 score, accommodate to the online setting. On the other hand, numerous methods and their implementations to handle online stream data have been recently proposed while open challenges still remain.

This Special Issue on Data Mining and Machine Learning Techniques for the Analysis of Stream Data is aimed at industrial and academic researchers proposing and applying non-traditional methods to solve stream data problems in data mining and machine learning. The key areas of this Special Issue include, but are not limited to online learning, incremental learning, streaming data in data mining, machine learning and artificial intelligence, distributed machine learning and data mining, continuous learning, scalable machine learning, distributed deep learning, and big data streams, along with their application to real-life problems.

Dr. Kichun Lee
Guest Editor

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Information is an international peer-reviewed open access monthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 1600 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • stream data
  • online data
  • data mining
  • machine learning
  • online learning
  • incremental learning

Published Papers (4 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

16 pages, 487 KiB  
Article
Missing Link Prediction Using Non-Overlapped Features and Multiple Sources of Social Networks
by Pokpong Songmuang, Chainarong Sirisup and Aroonwan Suebsriwichai
Information 2021, 12(5), 214; https://doi.org/10.3390/info12050214 - 18 May 2021
Cited by 4 | Viewed by 2425
Abstract
The current methods for missing link prediction in social networks focus on using data from overlapping users from two social network sources to recommend links between unconnected users. To improve prediction of the missing link, this paper presents the use of information from [...] Read more.
The current methods for missing link prediction in social networks focus on using data from overlapping users from two social network sources to recommend links between unconnected users. To improve prediction of the missing link, this paper presents the use of information from non-overlapping users as additional features in training a prediction model using a machine-learning approach. The proposed features are designed to use together with the common features as extra features to help in tuning up for a better classification model. The social network data sources used in this paper are Twitter and Facebook where Twitter is a main data for prediction and Facebook is a supporting data. For evaluations, a comparison using different machine-learning techniques, feature settings, and different network-density level of data source is studied. The experimental results can be concluded that the prediction model using a combination of the proposed features and the common features with Random Forest technique gained the best efficiency using percentage amount of recovering missing links and F1 score. The model of combined features yields higher percentage of recovering link by an average of 23.25% and the F1-measure by an average of 19.80% than the baseline of multi-social network source. Full article
Show Figures

Figure 1

21 pages, 2546 KiB  
Article
Short-Term Electricity Load Forecasting with Machine Learning
by Ernesto Aguilar Madrid and Nuno Antonio
Information 2021, 12(2), 50; https://doi.org/10.3390/info12020050 - 22 Jan 2021
Cited by 57 | Viewed by 8453
Abstract
An accurate short-term load forecasting (STLF) is one of the most critical inputs for power plant units’ planning commitment. STLF reduces the overall planning uncertainty added by the intermittent production of renewable sources; thus, it helps to minimize the hydrothermal electricity production costs [...] Read more.
An accurate short-term load forecasting (STLF) is one of the most critical inputs for power plant units’ planning commitment. STLF reduces the overall planning uncertainty added by the intermittent production of renewable sources; thus, it helps to minimize the hydrothermal electricity production costs in a power grid. Although there is some research in the field and even several research applications, there is a continual need to improve forecasts. This research proposes a set of machine learning (ML) models to improve the accuracy of 168 h forecasts. The developed models employ features from multiple sources, such as historical load, weather, and holidays. Of the five ML models developed and tested in various load profile contexts, the Extreme Gradient Boosting Regressor (XGBoost) algorithm showed the best results, surpassing previous historical weekly predictions based on neural networks. Additionally, because XGBoost models are based on an ensemble of decision trees, it facilitated the model’s interpretation, which provided a relevant additional result, the features’ importance in the forecasting. Full article
Show Figures

Figure 1

26 pages, 1546 KiB  
Article
A Frequent Pattern Conjunction Heuristic for Rule Generation in Data Streams
by Frederic Stahl, Thien Le, Atta Badii and Mohamed Medhat Gaber
Information 2021, 12(1), 24; https://doi.org/10.3390/info12010024 - 09 Jan 2021
Cited by 3 | Viewed by 2717
Abstract
This paper introduces a new and expressive algorithm for inducing descriptive rule-sets from streaming data in real-time in order to describe frequent patterns explicitly encoded in the stream. Data Stream Mining (DSM) is concerned with the automatic analysis of data streams in real-time. [...] Read more.
This paper introduces a new and expressive algorithm for inducing descriptive rule-sets from streaming data in real-time in order to describe frequent patterns explicitly encoded in the stream. Data Stream Mining (DSM) is concerned with the automatic analysis of data streams in real-time. Rapid flows of data challenge the state-of-the art processing and communication infrastructure, hence the motivation for research and innovation into real-time algorithms that analyse data streams on-the-fly and can automatically adapt to concept drifts. To date, DSM techniques have largely focused on predictive data mining applications that aim to forecast the value of a particular target feature of unseen data instances, answering questions such as whether a credit card transaction is fraudulent or not. A real-time, expressive and descriptive Data Mining technique for streaming data has not been previously established as part of the DSM toolkit. This has motivated the work reported in this paper, which has resulted in developing and validating a Generalised Rule Induction (GRI) tool, thus producing expressive rules as explanations that can be easily understood by human analysts. The expressiveness of decision models in data streams serves the objectives of transparency, underpinning the vision of ‘explainable AI’ and yet is an area of research that has attracted less attention despite being of high practical importance. The algorithm introduced and described in this paper is termed Fast Generalised Rule Induction (FGRI). FGRI is able to induce descriptive rules incrementally for raw data from both categorical and numerical features. FGRI is able to adapt rule-sets to changes of the pattern encoded in the data stream (concept drift) on the fly as new data arrives and can thus be applied continuously in real-time. The paper also provides a theoretical, qualitative and empirical evaluation of FGRI. Full article
Show Figures

Figure 1

17 pages, 4860 KiB  
Article
Identification of Malignancies from Free-Text Histopathology Reports Using a Multi-Model Supervised Machine Learning Approach
by Victor Olago, Mazvita Muchengeti, Elvira Singh and Wenlong C. Chen
Information 2020, 11(9), 455; https://doi.org/10.3390/info11090455 - 21 Sep 2020
Cited by 7 | Viewed by 2890
Abstract
We explored various Machine Learning (ML) models to evaluate how each model performs in the task of classifying histopathology reports. We trained, optimized, and performed classification with Stochastic Gradient Descent (SGD), Support Vector Machine (SVM), Random Forest (RF), K-Nearest Neighbor (KNN), Adaptive Boosting [...] Read more.
We explored various Machine Learning (ML) models to evaluate how each model performs in the task of classifying histopathology reports. We trained, optimized, and performed classification with Stochastic Gradient Descent (SGD), Support Vector Machine (SVM), Random Forest (RF), K-Nearest Neighbor (KNN), Adaptive Boosting (AB), Decision Trees (DT), Gaussian Naïve Bayes (GNB), Logistic Regression (LR), and Dummy classifier. We started with 60,083 histopathology reports, which reduced to 60,069 after pre-processing. The F1-scores for SVM, SGD KNN, RF, DT, LR, AB, and GNB were 97%, 96%, 96%, 96%, 92%, 96%, 84%, and 88%, respectively, while the misclassification rates were 3.31%, 5.25%, 4.39%, 1.75%, 3.5%, 4.26%, 23.9%, and 19.94%, respectively. The approximate run times were 2 h, 20 min, 40 min, 8 h, 40 min, 10 min, 50 min, and 4 min, respectively. RF had the longest run time but the lowest misclassification rate on the labeled data. Our study demonstrated the possibility of applying ML techniques in the processing of free-text pathology reports for cancer registries for cancer incidence reporting in a Sub-Saharan Africa setting. This is an important consideration for the resource-constrained environments to leverage ML techniques to reduce workloads and improve the timeliness of reporting of cancer statistics. Full article
Show Figures

Figure 1

Back to TopTop