Submit to Information Review for Information Propose a Special Issue

Journal Menu

Journal Browser

Machine Learning and Data Mining: Innovations in Big Data Analytics

Print Special Issue Flyer
Special Issue Editors
Special Issue Information
Keywords
Benefits of Publishing in a Special Issue
Related Special Issue
Published Papers

A special issue of Information (ISSN 2078-2489). This special issue belongs to the section "Information and Communications Technology".

Deadline for manuscript submissions: closed (30 June 2025) | Viewed by 23080

Share This Special Issue

Special Issue Editors

Dr. Shadi Banitaan

E-Mail Website
Guest Editor

Electrical & Computer Engineering & Computer Science Department, University of Detroit Mercy, Detroit, MI, USA
Interests: machine learning; data mining; applied artificial intelligence; intelligent systems
Special Issues, Collections and Topics in MDPI journals

Dr. Mina Maleki

E-Mail Website
Guest Editor Assistant

Electrical & Computer Engineering & Computer Science Department, University of Detroit Mercy, Detroit, MI, USA
Interests: machine learning; data analysis; applied artificial intelligence; bioinformatics
Special Issues, Collections and Topics in MDPI journals

Special Issue Information

Dear Colleagues,

The Special Issue on “Machine Learning and Data Mining: Innovations in Big Data Analytics” aims to explore the latest advancements and applications of machine learning and data mining techniques in the context of big data. As the volume, variety, and velocity of data continue to grow exponentially, there is a pressing need for innovative methods to extract meaningful insights and knowledge from large datasets. This Special Issue will bring together researchers and practitioners to present cutting-edge approaches, share experiences, and discuss future trends in this rapidly evolving field.

Contributions to this Special Issue should address theoretical, methodological, and practical aspects of machine learning and data mining as they relate to big data analytics. We welcome high-quality research papers, comprehensive reviews, and insightful case studies that highlight new challenges, propose novel solutions, and demonstrate successful applications in various domains such as healthcare, finance, social media, and more.

Topics of Interest:

Advanced machine learning algorithms for big data;
Scalable data mining techniques;
Deep learning and its applications in big data analytics;
Real-time data processing and analytics;
Predictive modeling and forecasting with big data;
Anomaly detection and pattern recognition in large datasets;
Big data visualization and interpretation;
Applications of machine learning and data mining in healthcare, finance, social media, etc.;
Ethical and privacy considerations in big data analytics;
Tools and frameworks for big data processing.

This Special Issue aims to be a comprehensive resource for those looking to stay at the forefront of machine learning and data mining as applied to big data. By bringing together diverse perspectives and pioneering research, we hope to foster a deeper understanding of the challenges and opportunities in this exciting field.

Dr. Shadi Banitaan
Guest Editor

Dr. Mina Maleki
Guest Editor Assistant

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Information is an international peer-reviewed open access monthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 1800 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

big data analytics
machine learning
data mining
deep learning
scalable algorithms
predictive modeling
anomaly detection
real-time processing
data visualization
ethical considerations
healthcare analytics
social media analytics
data privacy

Benefits of Publishing in a Special Issue

Ease of navigation: Grouping papers by topic helps scholars navigate broad scope journals more efficiently.
Greater discoverability: Special Issues support the reach and impact of scientific research. Articles in Special Issues are more discoverable and cited more frequently.
Expansion of research network: Special Issues facilitate connections among authors, fostering scientific collaborations.
External promotion: Articles in Special Issues are often promoted through the journal's social media, increasing their visibility.
Reprint: MDPI Books provides the opportunity to republish successful Special Issues in book format, both online and in print.

Further information on MDPI's Special Issue policies can be found here.

Related Special Issue

Machine Learning and Data Mining: Innovations in Big Data Analytics, 2nd Edition in Information

Published Papers (10 papers)

Download All Papers

Order results

Result details

Show export options Show export options

Select all

Export citation of selected articles as:

Research

Jump to: Review

35 pages, 6888 KiB

Open AccessArticle

AirTrace-SA: Air Pollution Tracing for Source Attribution

by Wenchuan Zhao, Qi Zhang, Ting Shu and Xia Du

Information 2025, 16(7), 603; https://doi.org/10.3390/info16070603 - 13 Jul 2025

Viewed by 328

Abstract

Air pollution source tracing is vital for effective pollution prevention and control, yet traditional methods often require large amounts of manual data, have limited cross-regional generalizability, and present challenges in capturing complex pollutant interactions. This study introduces AirTrace-SA (Air Pollution Tracing for Source Attribution), a novel hybrid deep learning model designed for the accurate identification and quantification of air pollution sources. AirTrace-SA comprises three main components: a hierarchical feature extractor (HFE) that extracts multi-scale features from chemical components, a source association bridge (SAB) that links chemical features to pollution sources through a multi-step decision mechanism, and a source contribution quantifier (SCQ) based on the TabNet regressor for the precise prediction of source contributions. Evaluated on real air quality datasets from five cities (Lanzhou, Luoyang, Haikou, Urumqi, and Hangzhou), AirTrace-SA achieves an average

R^{2}

of 0.88 (ranging from 0.84 to 0.94 across 10-fold cross-validation), an average mean absolute error (

M A E

) of 0.60 (ranging from 0.46 to 0.78 across five cities), and an average root mean square error (

R M S E

) of 1.06 (ranging from 0.51 to 1.62 across ten pollution sources). The model outperforms baseline models such as 1D CNN and LightGBM in terms of stability, accuracy, and cross-city generalization. Feature importance analysis identifies the main contributions of source categories, further improving interpretability. By reducing the reliance on labor-intensive data collection and providing scalable, high-precision source tracing, AirTrace-SA offers a powerful tool for environmental management that supports targeted emission reduction strategies and sustainable development. Full article

(This article belongs to the Special Issue Machine Learning and Data Mining: Innovations in Big Data Analytics)

► Show Figures

Figure 1

27 pages, 10552 KiB

Open AccessArticle

Enhancing Dongba Pictograph Recognition Using Convolutional Neural Networks and Data Augmentation Techniques

by Shihui Li, Lan Thi Nguyen, Wirapong Chansanam, Natthakan Iam-On and Tossapon Boongoen

Information 2025, 16(5), 362; https://doi.org/10.3390/info16050362 - 29 Apr 2025

Viewed by 582

Abstract

The recognition of Dongba pictographs presents significant challenges due to the pitfalls in traditional feature extraction methods, classification algorithms’ high complexity, and generalization ability. This study proposes a convolutional neural network (CNN)-based image classification method to enhance the accuracy and efficiency of Dongba pictograph recognition. The research begins with collecting and manually categorizing Dongba pictograph images, followed by these preprocessing steps to improve image quality: normalization, grayscale conversion, filtering, denoising, and binarization. The dataset, comprising 70,000 image samples, is categorized into 18 classes based on shape characteristics and manual annotations. A CNN model is then trained using a dataset that is split into training (with 70% of all the samples), validation (20%), and test (10%) sets. In particular, data augmentation techniques, including rotation, affine transformation, scaling, and translation, are applied to enhance classification accuracy. Experimental results demonstrate that the proposed model achieves a classification accuracy of 99.43% and consistently outperforms other conventional methods, with its performance peaking at 99.84% under optimized training conditions—specifically, with 75 training epochs and a batch size of 512. This study provides a robust and efficient solution for automatically classifying Dongba pictographs, contributing to their digital preservation and scholarly research. By leveraging deep learning techniques, the proposed approach facilitates the rapid and precise identification of Dongba hieroglyphs, supporting the ongoing efforts in cultural heritage preservation and the broader application of artificial intelligence in linguistic studies. Full article

(This article belongs to the Special Issue Machine Learning and Data Mining: Innovations in Big Data Analytics)

► Show Figures

Figure 1

38 pages, 1737 KiB

Open AccessArticle

Deep Learning Scheduling on a Field-Programmable Gate Array Cluster Using Configurable Deep Learning Accelerators

by Tianyang Fang, Alejandro Perez-Vicente, Hans Johnson and Jafar Saniie

Information 2025, 16(4), 298; https://doi.org/10.3390/info16040298 - 8 Apr 2025

Viewed by 2533

Abstract

This paper presents the development and evaluation of a distributed system employing low-latency embedded field-programmable gate arrays (FPGAs) to optimize scheduling for deep learning (DL) workloads and to configure multiple deep learning accelerator (DLA) architectures. Aimed at advancing FPGA applications in real-time edge computing, this study focuses on achieving optimal latency for a distributed computing system. A novel methodology was adopted, using configurable hardware to examine clusters of DLAs, varying in architecture and scheduling techniques. The system demonstrated its capability to parallel-process diverse neural network (NN) models, manage compute graphs in a pipelined sequence, and allocate computational resources efficiently to intensive NN layers. We examined five configurable DLAs—Versatile Tensor Accelerator (VTA), Nvidia DLA (NVDLA), Xilinx Deep Processing Unit (DPU), Tensil Compute Unit (CU), and Pipelined Convolutional Neural Network (PipeCNN)—across two FPGA cluster types consisting of Zynq-7000 and Zynq UltraScale+ System-on-Chip (SoC) processors, respectively. Four deep neural network (DNN) workloads were tested: Scatter-Gather, AI Core Assignment, Pipeline Scheduling, and Fused Scheduling. These methods revealed an exponential decay in processing time up to 90% speedup, although deviations were noted depending on the workload and cluster configuration. This research substantiates FPGAs’ utility in adaptable, efficient DL deployment, setting a precedent for future experimental configurations and performance benchmarks. Full article

(This article belongs to the Special Issue Machine Learning and Data Mining: Innovations in Big Data Analytics)

► Show Figures

Figure 1

23 pages, 1882 KiB

Open AccessArticle

Attention Mechanism-Based Cognition-Level Scene Understanding

by Xuejiao Tang and Wenbin Zhang

Information 2025, 16(3), 203; https://doi.org/10.3390/info16030203 - 5 Mar 2025

Viewed by 887

Abstract

Given a question–image input, a visual commonsense reasoning (VCR) model predicts an answer with a corresponding rationale, which requires inference abilities based on real-world knowledge. The VCR task, which calls for exploiting multi-source information as well as learning different levels of understanding and extensive commonsense knowledge, is a cognition-level scene understanding challenge. The VCR task has aroused researchers’ interests due to its wide range of applications, including visual question answering, automated vehicle systems, and clinical decision support. Previous approaches to solving the VCR task have generally relied on pre-training or exploiting memory with long-term dependency relationship-encoded models. However, these approaches suffer from a lack of generalizability and a loss of information in long sequences. In this work, we propose a parallel attention-based cognitive VCR network, termed PAVCR, which fuses visual–textual information efficiently and encodes semantic information in parallel to enable the model to capture rich information for cognition-level inference. Extensive experiments show that the proposed model yields significant improvements over existing methods on the benchmark VCR dataset. Moreover, the proposed model provides an intuitive interpretation of visual commonsense reasoning. Full article

(This article belongs to the Special Issue Machine Learning and Data Mining: Innovations in Big Data Analytics)

► Show Figures

Figure 1

33 pages, 3144 KiB

Open AccessArticle

CNN-Based Optimization for Fish Species Classification: Tackling Environmental Variability, Class Imbalance, and Real-Time Constraints

by Amirhosein Mohammadisabet, Raza Hasan, Vishal Dattana, Salman Mahmood and Saqib Hussain

Information 2025, 16(2), 154; https://doi.org/10.3390/info16020154 - 19 Feb 2025

Cited by 1 | Viewed by 1636

Abstract

Automated fish species classification is essential for marine biodiversity monitoring, fisheries management, and ecological research. However, challenges such as environmental variability, class imbalance, and computational demands hinder the development of robust classification models. This study investigates the effectiveness of convolutional neural network (CNN)-based models and hybrid approaches to address these challenges. Eight CNN architectures, including DenseNet121, MobileNetV2, and Xception, were compared alongside traditional classifiers like support vector machines (SVMs) and random forest. DenseNet121 achieved the highest accuracy (90.2%), leveraging its superior feature extraction and generalization capabilities, while MobileNetV2 balanced accuracy (83.57%) with computational efficiency, processing images in 0.07 s, making it ideal for real-time deployment. Advanced preprocessing techniques, such as data augmentation, turbidity simulation, and transfer learning, were employed to enhance dataset robustness and address class imbalance. Hybrid models combining CNNs with traditional classifiers achieved intermediate accuracy with improved interpretability. Optimization techniques, including pruning and quantization, reduced model size by 73.7%, enabling real-time deployment on resource-constrained devices. Grad-CAM visualizations further enhanced interpretability by identifying key image regions influencing predictions. This study highlights the potential of CNN-based models for scalable, interpretable fish species classification, offering actionable insights for sustainable fisheries management and biodiversity conservation. Full article

(This article belongs to the Special Issue Machine Learning and Data Mining: Innovations in Big Data Analytics)

► Show Figures

Graphical abstract

18 pages, 1031 KiB

Open AccessArticle

An Ensemble Framework for Text Classification

by Eleni Kamateri and Michail Salampasis

Information 2025, 16(2), 85; https://doi.org/10.3390/info16020085 - 23 Jan 2025

Cited by 1 | Viewed by 1454

Abstract

Ensemble learning can improve predictive performance compared to the performance of any of its constituents alone, while keeping computational demands manageable. However, no reference methodology is available for developing ensemble systems. In this paper, we adapt an ensemble framework for patent classification to assist data scientists in creating flexible ensemble architectures for text classification by selecting a finite set of constituent base models from the many available alternatives. We analyze the axes along which someone can select base models of an ensemble system and propose a methodology for combining them. Moreover, we conduct experiments to compare the effectiveness of ensemble systems against base models and state-of-the-art methods on multiple datasets (three patent classification and two text classification datasets), including long and short texts and single- and/or multi-labeled texts. The results verify the generality of our framework and the effectiveness of ensemble systems, especially ensembles of classifiers trained on different data sections/metadata. Full article

(This article belongs to the Special Issue Machine Learning and Data Mining: Innovations in Big Data Analytics)

► Show Figures

Graphical abstract

22 pages, 627 KiB

Open AccessArticle

Fitness Approximation Through Machine Learning with Dynamic Adaptation to the Evolutionary State

by Itai Tzruia, Tomer Halperin, Moshe Sipper and Achiya Elyasaf

Information 2024, 15(12), 744; https://doi.org/10.3390/info15120744 - 21 Nov 2024

Viewed by 2207

Abstract

We present a novel approach to performing fitness approximation in genetic algorithms (GAs) using machine learning (ML) models, focusing on dynamic adaptation to the evolutionary state. We compare different methods for (1) switching between actual and approximate fitness, (2) sampling the population, and (3) weighting the samples. Experimental findings demonstrate significant improvement in evolutionary runtimes, with fitness scores that are either identical or slightly lower than those of the fully run GA—depending on the ratio of approximate-to-actual-fitness computation. Although we focus on evolutionary agents in Gymnasium (game) simulators—where fitness computation is costly—our approach is generic and can be easily applied to many different domains. Full article

(This article belongs to the Special Issue Machine Learning and Data Mining: Innovations in Big Data Analytics)

► Show Figures

Graphical abstract

25 pages, 4208 KiB

Open AccessArticle

Adaptive and Scalable Database Management with Machine Learning Integration: A PostgreSQL Case Study

by Maryam Abbasi, Marco V. Bernardo, Paulo Váz, José Silva and Pedro Martins

Information 2024, 15(9), 574; https://doi.org/10.3390/info15090574 - 18 Sep 2024

Cited by 3 | Viewed by 5271

Abstract

The increasing complexity of managing modern database systems, particularly in terms of optimizing query performance for large datasets, presents significant challenges that traditional methods often fail to address. This paper proposes a comprehensive framework for integrating advanced machine learning (ML) models within the architecture of a database management system (DBMS), with a specific focus on PostgreSQL. Our approach leverages a combination of supervised and unsupervised learning techniques to predict query execution times, optimize performance, and dynamically manage workloads. Unlike existing solutions that address specific optimization tasks in isolation, our framework provides a unified platform that supports real-time model inference and automatic database configuration adjustments based on workload patterns. A key contribution of our work is the integration of ML capabilities directly into the DBMS engine, enabling seamless interaction between the ML models and the query optimization process. This integration allows for the automatic retraining of models and dynamic workload management, resulting in substantial improvements in both query response times and overall system throughput. Our evaluations using the Transaction Processing Performance Council Decision Support (TPC-DS) benchmark dataset at scale factors of 100 GB, 1 TB, and 10 TB demonstrate a reduction of up to 42% in query execution times and a 74% improvement in throughput compared with traditional approaches. Additionally, we address challenges such as potential conflicts in tuning recommendations and the performance overhead associated with ML integration, providing insights for future research directions. This study is motivated by the need for autonomous tuning mechanisms to manage large-scale, heterogeneous workloads while answering key research questions, such as the following: (1) How can machine learning models be integrated into a DBMS to improve query optimization and workload management? (2) What performance improvements can be achieved through dynamic configuration tuning based on real-time workload patterns? Our results suggest that the proposed framework significantly reduces the need for manual database administration while effectively adapting to evolving workloads, offering a robust solution for modern large-scale data environments. Full article

(This article belongs to the Special Issue Machine Learning and Data Mining: Innovations in Big Data Analytics)

► Show Figures

Graphical abstract

26 pages, 15092 KiB

Open AccessArticle

Exploring the Depths of the Autocorrelation Function: Its Departure from Normality

by Hossein Hassani, Manuela Royer-Carenzi, Leila Marvian Mashhad, Masoud Yarmohammadi and Mohammad Reza Yeganegi

Information 2024, 15(8), 449; https://doi.org/10.3390/info15080449 - 30 Jul 2024

Cited by 7 | Viewed by 2697

Abstract

In this article, we study the autocorrelation function (ACF), which is a crucial element in time series analysis. We compare the distribution of the ACF, both from a theoretical and empirical point of view. We focus on white noise processes (WN), i.e., uncorrelated, centered, and identically distributed variables, whose ACFs are supposed to be asymptotically independent and converge towards the same normal distribution. But, the study of the sum of the sample ACF contradicts this property. Thus, our findings reveal a deviation of the sample ACF from normality beyond a specific lag. Note that this phenomenon is observed for white noise of varying lengths, and evenforn the residuals of an ARMA(

p, q

) model. This discovery challenges traditional assumptions of normality in time series modeling. Indeed, when modeling a time series, the crucial step is to validate the estimated model by checking that the associated residuals form white noise. In this study, we show that the widely used portmanteau tests are not completely accurate. Box–Pierce appears to be too conservative, whereas Ljung–Box is too liberal. We suggest an alternative method based on the ACF for establishing the reliability of the portmanteau test and the validity of the estimated model. We illustrate our methodology using money stock data in the USA. Full article

(This article belongs to the Special Issue Machine Learning and Data Mining: Innovations in Big Data Analytics)

► Show Figures

Figure 1

Review

Jump to: Research

35 pages, 644 KiB

Open AccessReview

Machine Learning in Baseball Analytics: Sabermetrics and Beyond

by Wenbing Zhao, Vyaghri Seetharamayya Akella, Shunkun Yang and Xiong Luo

Information 2025, 16(5), 361; https://doi.org/10.3390/info16050361 - 29 Apr 2025

Viewed by 3534

Abstract

In this article, we provide a comprehensive review of machine learning-based sports analytics in baseball. This review is primarily guided by the following three research questions: (1) What baseball analytics problems have been studied using machine learning? (2) What data repositories have been used? (3) What and how machine learning techniques have been employed for these studies? The findings of these research questions lead to several research contributions. First, we provide a taxonomy for baseball analytics problems. According to the proposed taxonomy, machine learning has been employed to (1) predict individual game plays; (2) determine player performance; (3) estimate player valuation; (4) predict future player injuries; and (5) project future game outcomes. Second, we identify a set of data repositories for baseball analytics studies. The most popular data repositories are Baseball Savant and Baseball Reference. Third, we conduct an in-depth analysis of the machine learning models applied in baseball analytics. The most popular machine learning models are random forest and support vector machine. Furthermore, only a small fraction of studies have rigorously followed the best practices in data preprocessing, machine learning model training, testing, and prediction outcome interpretation. Full article

(This article belongs to the Special Issue Machine Learning and Data Mining: Innovations in Big Data Analytics)

► Show Figures

Journal Menu

Journal Browser

Machine Learning and Data Mining: Innovations in Big Data Analytics

Share This Special Issue

Special Issue Editors

Special Issue Information

Keywords

Benefits of Publishing in a Special Issue

Related Special Issue

Published Papers (10 papers)

Research

Review

Further Information

Guidelines

MDPI Initiatives

Follow MDPI