Submit to Information Review for Information Propose a Special Issue

Journal Menu

Journal Browser

► Journal Browser

Data Mining & Machine Learning Techniques for the Analysis of Stream Data

Print Special Issue Flyer
Special Issue Editors
Special Issue Information
Keywords
Benefits of Publishing in a Special Issue
Published Papers

A special issue of Information (ISSN 2078-2489). This special issue belongs to the section "Information Processes".

Deadline for manuscript submissions: closed (21 December 2023) | Viewed by 25706

Share This Special Issue

Special Issue Editor

Dr. Kichun Lee

E-Mail Website
Guest Editor

Industrial Engineering, Hanyang University, Seoul 04763, Republic of Korea
Interests: data mining; machine learning; online learning; big data analysis; time series analysis
Special Issues, Collections and Topics in MDPI journals

Special Issue Information

Dear Colleagues,

Recently, stream data can easily be found and stored on a nearly real-time basis in a variety of domains such as sensor networks, e-commerce, financial systems, and social network industry, to name a few. The rise and abundance of stream data have increased the need of online learning or incremental learning in data mining and machine learning. Traditional methods in data mining and machine learning are hardly coping with the large amount of real-time data. Numerous data mining and machine learning methods have found their ways in the tasks of regression, classification, recommendation, and outlier detection.

While such methods in data mining and machine learning have been beneficial for offline data, usually by batch learning, when dealing with stream data, they have also highlighted a number of shortcomings, such as the restriction of memory, the speed of model update, the inability to guarantee the best model, the issue of window size, and so forth. In addition, the presence of high-speed stream data makes traditional performance measures, such as accuracy and F1 score, accommodate to the online setting. On the other hand, numerous methods and their implementations to handle online stream data have been recently proposed while open challenges still remain.

This Special Issue on Data Mining and Machine Learning Techniques for the Analysis of Stream Data is aimed at industrial and academic researchers proposing and applying non-traditional methods to solve stream data problems in data mining and machine learning. The key areas of this Special Issue include, but are not limited to online learning, incremental learning, streaming data in data mining, machine learning and artificial intelligence, distributed machine learning and data mining, continuous learning, scalable machine learning, distributed deep learning, and big data streams, along with their application to real-life problems.

Dr. Kichun Lee
Guest Editor

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Information is an international peer-reviewed open access monthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 1800 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

stream data
online data
data mining
machine learning
online learning
incremental learning

Benefits of Publishing in a Special Issue

Ease of navigation: Grouping papers by topic helps scholars navigate broad scope journals more efficiently.
Greater discoverability: Special Issues support the reach and impact of scientific research. Articles in Special Issues are more discoverable and cited more frequently.
Expansion of research network: Special Issues facilitate connections among authors, fostering scientific collaborations.
External promotion: Articles in Special Issues are often promoted through the journal's social media, increasing their visibility.
Reprint: MDPI Books provides the opportunity to republish successful Special Issues in book format, both online and in print.

Further information on MDPI's Special Issue policies can be found here.

Published Papers (4 papers)

Download All Papers

Order results

Result details

Show export options Show export options

Select all

Export citation of selected articles as:

Research

16 pages, 487 KiB

Open AccessArticle

Missing Link Prediction Using Non-Overlapped Features and Multiple Sources of Social Networks

by Pokpong Songmuang, Chainarong Sirisup and Aroonwan Suebsriwichai

Information 2021, 12(5), 214; https://doi.org/10.3390/info12050214 - 18 May 2021

Cited by 6 | Viewed by 3233

Abstract

The current methods for missing link prediction in social networks focus on using data from overlapping users from two social network sources to recommend links between unconnected users. To improve prediction of the missing link, this paper presents the use of information from non-overlapping users as additional features in training a prediction model using a machine-learning approach. The proposed features are designed to use together with the common features as extra features to help in tuning up for a better classification model. The social network data sources used in this paper are Twitter and Facebook where Twitter is a main data for prediction and Facebook is a supporting data. For evaluations, a comparison using different machine-learning techniques, feature settings, and different network-density level of data source is studied. The experimental results can be concluded that the prediction model using a combination of the proposed features and the common features with Random Forest technique gained the best efficiency using percentage amount of recovering missing links and F1 score. The model of combined features yields higher percentage of recovering link by an average of 23.25% and the F1-measure by an average of 19.80% than the baseline of multi-social network source. Full article

(This article belongs to the Special Issue Data Mining & Machine Learning Techniques for the Analysis of Stream Data)

► Show Figures

Figure 1

21 pages, 2546 KiB

Open AccessArticle

Short-Term Electricity Load Forecasting with Machine Learning

by Ernesto Aguilar Madrid and Nuno Antonio

Information 2021, 12(2), 50; https://doi.org/10.3390/info12020050 - 22 Jan 2021

Cited by 122 | Viewed by 13254

Abstract

An accurate short-term load forecasting (STLF) is one of the most critical inputs for power plant units’ planning commitment. STLF reduces the overall planning uncertainty added by the intermittent production of renewable sources; thus, it helps to minimize the hydrothermal electricity production costs in a power grid. Although there is some research in the field and even several research applications, there is a continual need to improve forecasts. This research proposes a set of machine learning (ML) models to improve the accuracy of 168 h forecasts. The developed models employ features from multiple sources, such as historical load, weather, and holidays. Of the five ML models developed and tested in various load profile contexts, the Extreme Gradient Boosting Regressor (XGBoost) algorithm showed the best results, surpassing previous historical weekly predictions based on neural networks. Additionally, because XGBoost models are based on an ensemble of decision trees, it facilitated the model’s interpretation, which provided a relevant additional result, the features’ importance in the forecasting. Full article

(This article belongs to the Special Issue Data Mining & Machine Learning Techniques for the Analysis of Stream Data)

► Show Figures

Figure 1

26 pages, 1546 KiB

Open AccessArticle

A Frequent Pattern Conjunction Heuristic for Rule Generation in Data Streams

by Frederic Stahl, Thien Le, Atta Badii and Mohamed Medhat Gaber

Information 2021, 12(1), 24; https://doi.org/10.3390/info12010024 - 9 Jan 2021

Cited by 3 | Viewed by 3682

Abstract

This paper introduces a new and expressive algorithm for inducing descriptive rule-sets from streaming data in real-time in order to describe frequent patterns explicitly encoded in the stream. Data Stream Mining (DSM) is concerned with the automatic analysis of data streams in real-time. Rapid flows of data challenge the state-of-the art processing and communication infrastructure, hence the motivation for research and innovation into real-time algorithms that analyse data streams on-the-fly and can automatically adapt to concept drifts. To date, DSM techniques have largely focused on predictive data mining applications that aim to forecast the value of a particular target feature of unseen data instances, answering questions such as whether a credit card transaction is fraudulent or not. A real-time, expressive and descriptive Data Mining technique for streaming data has not been previously established as part of the DSM toolkit. This has motivated the work reported in this paper, which has resulted in developing and validating a Generalised Rule Induction (GRI) tool, thus producing expressive rules as explanations that can be easily understood by human analysts. The expressiveness of decision models in data streams serves the objectives of transparency, underpinning the vision of ‘explainable AI’ and yet is an area of research that has attracted less attention despite being of high practical importance. The algorithm introduced and described in this paper is termed Fast Generalised Rule Induction (FGRI). FGRI is able to induce descriptive rules incrementally for raw data from both categorical and numerical features. FGRI is able to adapt rule-sets to changes of the pattern encoded in the data stream (concept drift) on the fly as new data arrives and can thus be applied continuously in real-time. The paper also provides a theoretical, qualitative and empirical evaluation of FGRI. Full article

(This article belongs to the Special Issue Data Mining & Machine Learning Techniques for the Analysis of Stream Data)

► Show Figures

Figure 1

17 pages, 4860 KiB

Open AccessArticle

Identification of Malignancies from Free-Text Histopathology Reports Using a Multi-Model Supervised Machine Learning Approach

by Victor Olago, Mazvita Muchengeti, Elvira Singh and Wenlong C. Chen

Information 2020, 11(9), 455; https://doi.org/10.3390/info11090455 - 21 Sep 2020

Cited by 7 | Viewed by 3675

Abstract

We explored various Machine Learning (ML) models to evaluate how each model performs in the task of classifying histopathology reports. We trained, optimized, and performed classification with Stochastic Gradient Descent (SGD), Support Vector Machine (SVM), Random Forest (RF), K-Nearest Neighbor (KNN), Adaptive Boosting (AB), Decision Trees (DT), Gaussian Naïve Bayes (GNB), Logistic Regression (LR), and Dummy classifier. We started with 60,083 histopathology reports, which reduced to 60,069 after pre-processing. The F1-scores for SVM, SGD KNN, RF, DT, LR, AB, and GNB were 97%, 96%, 96%, 96%, 92%, 96%, 84%, and 88%, respectively, while the misclassification rates were 3.31%, 5.25%, 4.39%, 1.75%, 3.5%, 4.26%, 23.9%, and 19.94%, respectively. The approximate run times were 2 h, 20 min, 40 min, 8 h, 40 min, 10 min, 50 min, and 4 min, respectively. RF had the longest run time but the lowest misclassification rate on the labeled data. Our study demonstrated the possibility of applying ML techniques in the processing of free-text pathology reports for cancer registries for cancer incidence reporting in a Sub-Saharan Africa setting. This is an important consideration for the resource-constrained environments to leverage ML techniques to reduce workloads and improve the timeliness of reporting of cancer statistics. Full article

(This article belongs to the Special Issue Data Mining & Machine Learning Techniques for the Analysis of Stream Data)

► Show Figures

Journal Menu

Journal Browser

Data Mining & Machine Learning Techniques for the Analysis of Stream Data

Share This Special Issue

Special Issue Editor

Special Issue Information

Keywords

Benefits of Publishing in a Special Issue

Published Papers (4 papers)

Research

Further Information

Guidelines

MDPI Initiatives

Follow MDPI