Machine Learning and Data Mining: Innovations in Big Data Analytics

A special issue of Information (ISSN 2078-2489). This special issue belongs to the section "Information and Communications Technology".

Deadline for manuscript submissions: 30 June 2025 | Viewed by 11833

Special Issue Editors


E-Mail Website
Guest Editor
Electrical & Computer Engineering & Computer Science Department, University of Detroit Mercy, Detroit MI 48221-9900, USA
Interests: machine learning; data mining; applied artificial intelligence; intelligent systems

E-Mail Website
Guest Editor Assistant
Electrical & Computer Engineering & Computer Science Department, University of Detroit Mercy, Detroit MI 48221-9900, USA
Interests: machine learning; data analysis; applied artificial intelligence; bioinformatics

Special Issue Information

Dear Colleagues,

The Special Issue on “Machine Learning and Data Mining: Innovations in Big Data Analytics” aims to explore the latest advancements and applications of machine learning and data mining techniques in the context of big data. As the volume, variety, and velocity of data continue to grow exponentially, there is a pressing need for innovative methods to extract meaningful insights and knowledge from large datasets. This Special Issue will bring together researchers and practitioners to present cutting-edge approaches, share experiences, and discuss future trends in this rapidly evolving field.

Contributions to this Special Issue should address theoretical, methodological, and practical aspects of machine learning and data mining as they relate to big data analytics. We welcome high-quality research papers, comprehensive reviews, and insightful case studies that highlight new challenges, propose novel solutions, and demonstrate successful applications in various domains such as healthcare, finance, social media, and more.

Topics of Interest:

  • Advanced machine learning algorithms for big data;
  • Scalable data mining techniques;
  • Deep learning and its applications in big data analytics;
  • Real-time data processing and analytics;
  • Predictive modeling and forecasting with big data;
  • Anomaly detection and pattern recognition in large datasets;
  • Big data visualization and interpretation;
  • Applications of machine learning and data mining in healthcare, finance, social media, etc.;
  • Ethical and privacy considerations in big data analytics;
  • Tools and frameworks for big data processing.

This Special Issue aims to be a comprehensive resource for those looking to stay at the forefront of machine learning and data mining as applied to big data. By bringing together diverse perspectives and pioneering research, we hope to foster a deeper understanding of the challenges and opportunities in this exciting field.

Dr. Shadi Banitaan
Guest Editor

Dr. Mina Maleki
Guest Editor Assistant

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Information is an international peer-reviewed open access monthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 1600 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • big data analytics
  • machine learning
  • data mining
  • deep learning
  • scalable algorithms
  • predictive modeling
  • anomaly detection
  • real-time processing
  • data visualization
  • ethical considerations
  • healthcare analytics
  • social media analytics
  • data privacy

Benefits of Publishing in a Special Issue

  • Ease of navigation: Grouping papers by topic helps scholars navigate broad scope journals more efficiently.
  • Greater discoverability: Special Issues support the reach and impact of scientific research. Articles in Special Issues are more discoverable and cited more frequently.
  • Expansion of research network: Special Issues facilitate connections among authors, fostering scientific collaborations.
  • External promotion: Articles in Special Issues are often promoted through the journal's social media, increasing their visibility.
  • e-Book format: Special Issues with more than 10 articles can be published as dedicated e-books, ensuring wide and rapid dissemination.

Further information on MDPI's Special Issue policies can be found here.

Published Papers (9 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

Jump to: Review

27 pages, 10552 KiB  
Article
Enhancing Dongba Pictograph Recognition Using Convolutional Neural Networks and Data Augmentation Techniques
by Shihui Li, Lan Thi Nguyen, Wirapong Chansanam, Natthakan Iam-On and Tossapon Boongoen
Information 2025, 16(5), 362; https://doi.org/10.3390/info16050362 - 29 Apr 2025
Abstract
The recognition of Dongba pictographs presents significant challenges due to the pitfalls in traditional feature extraction methods, classification algorithms’ high complexity, and generalization ability. This study proposes a convolutional neural network (CNN)-based image classification method to enhance the accuracy and efficiency of Dongba [...] Read more.
The recognition of Dongba pictographs presents significant challenges due to the pitfalls in traditional feature extraction methods, classification algorithms’ high complexity, and generalization ability. This study proposes a convolutional neural network (CNN)-based image classification method to enhance the accuracy and efficiency of Dongba pictograph recognition. The research begins with collecting and manually categorizing Dongba pictograph images, followed by these preprocessing steps to improve image quality: normalization, grayscale conversion, filtering, denoising, and binarization. The dataset, comprising 70,000 image samples, is categorized into 18 classes based on shape characteristics and manual annotations. A CNN model is then trained using a dataset that is split into training (with 70% of all the samples), validation (20%), and test (10%) sets. In particular, data augmentation techniques, including rotation, affine transformation, scaling, and translation, are applied to enhance classification accuracy. Experimental results demonstrate that the proposed model achieves a classification accuracy of 99.43% and consistently outperforms other conventional methods, with its performance peaking at 99.84% under optimized training conditions—specifically, with 75 training epochs and a batch size of 512. This study provides a robust and efficient solution for automatically classifying Dongba pictographs, contributing to their digital preservation and scholarly research. By leveraging deep learning techniques, the proposed approach facilitates the rapid and precise identification of Dongba hieroglyphs, supporting the ongoing efforts in cultural heritage preservation and the broader application of artificial intelligence in linguistic studies. Full article
(This article belongs to the Special Issue Machine Learning and Data Mining: Innovations in Big Data Analytics)
Show Figures

Figure 1

38 pages, 1737 KiB  
Article
Deep Learning Scheduling on a Field-Programmable Gate Array Cluster Using Configurable Deep Learning Accelerators
by Tianyang Fang, Alejandro Perez-Vicente, Hans Johnson and Jafar Saniie
Information 2025, 16(4), 298; https://doi.org/10.3390/info16040298 - 8 Apr 2025
Viewed by 1006
Abstract
This paper presents the development and evaluation of a distributed system employing low-latency embedded field-programmable gate arrays (FPGAs) to optimize scheduling for deep learning (DL) workloads and to configure multiple deep learning accelerator (DLA) architectures. Aimed at advancing FPGA applications in real-time edge [...] Read more.
This paper presents the development and evaluation of a distributed system employing low-latency embedded field-programmable gate arrays (FPGAs) to optimize scheduling for deep learning (DL) workloads and to configure multiple deep learning accelerator (DLA) architectures. Aimed at advancing FPGA applications in real-time edge computing, this study focuses on achieving optimal latency for a distributed computing system. A novel methodology was adopted, using configurable hardware to examine clusters of DLAs, varying in architecture and scheduling techniques. The system demonstrated its capability to parallel-process diverse neural network (NN) models, manage compute graphs in a pipelined sequence, and allocate computational resources efficiently to intensive NN layers. We examined five configurable DLAs—Versatile Tensor Accelerator (VTA), Nvidia DLA (NVDLA), Xilinx Deep Processing Unit (DPU), Tensil Compute Unit (CU), and Pipelined Convolutional Neural Network (PipeCNN)—across two FPGA cluster types consisting of Zynq-7000 and Zynq UltraScale+ System-on-Chip (SoC) processors, respectively. Four deep neural network (DNN) workloads were tested: Scatter-Gather, AI Core Assignment, Pipeline Scheduling, and Fused Scheduling. These methods revealed an exponential decay in processing time up to 90% speedup, although deviations were noted depending on the workload and cluster configuration. This research substantiates FPGAs’ utility in adaptable, efficient DL deployment, setting a precedent for future experimental configurations and performance benchmarks. Full article
(This article belongs to the Special Issue Machine Learning and Data Mining: Innovations in Big Data Analytics)
Show Figures

Figure 1

23 pages, 1882 KiB  
Article
Attention Mechanism-Based Cognition-Level Scene Understanding
by Xuejiao Tang and Wenbin Zhang
Information 2025, 16(3), 203; https://doi.org/10.3390/info16030203 - 5 Mar 2025
Viewed by 501
Abstract
Given a question–image input, a visual commonsense reasoning (VCR) model predicts an answer with a corresponding rationale, which requires inference abilities based on real-world knowledge. The VCR task, which calls for exploiting multi-source information as well as learning different levels of understanding and [...] Read more.
Given a question–image input, a visual commonsense reasoning (VCR) model predicts an answer with a corresponding rationale, which requires inference abilities based on real-world knowledge. The VCR task, which calls for exploiting multi-source information as well as learning different levels of understanding and extensive commonsense knowledge, is a cognition-level scene understanding challenge. The VCR task has aroused researchers’ interests due to its wide range of applications, including visual question answering, automated vehicle systems, and clinical decision support. Previous approaches to solving the VCR task have generally relied on pre-training or exploiting memory with long-term dependency relationship-encoded models. However, these approaches suffer from a lack of generalizability and a loss of information in long sequences. In this work, we propose a parallel attention-based cognitive VCR network, termed PAVCR, which fuses visual–textual information efficiently and encodes semantic information in parallel to enable the model to capture rich information for cognition-level inference. Extensive experiments show that the proposed model yields significant improvements over existing methods on the benchmark VCR dataset. Moreover, the proposed model provides an intuitive interpretation of visual commonsense reasoning. Full article
(This article belongs to the Special Issue Machine Learning and Data Mining: Innovations in Big Data Analytics)
Show Figures

Figure 1

33 pages, 3144 KiB  
Article
CNN-Based Optimization for Fish Species Classification: Tackling Environmental Variability, Class Imbalance, and Real-Time Constraints
by Amirhosein Mohammadisabet, Raza Hasan, Vishal Dattana, Salman Mahmood and Saqib Hussain
Information 2025, 16(2), 154; https://doi.org/10.3390/info16020154 - 19 Feb 2025
Viewed by 716
Abstract
Automated fish species classification is essential for marine biodiversity monitoring, fisheries management, and ecological research. However, challenges such as environmental variability, class imbalance, and computational demands hinder the development of robust classification models. This study investigates the effectiveness of convolutional neural network (CNN)-based [...] Read more.
Automated fish species classification is essential for marine biodiversity monitoring, fisheries management, and ecological research. However, challenges such as environmental variability, class imbalance, and computational demands hinder the development of robust classification models. This study investigates the effectiveness of convolutional neural network (CNN)-based models and hybrid approaches to address these challenges. Eight CNN architectures, including DenseNet121, MobileNetV2, and Xception, were compared alongside traditional classifiers like support vector machines (SVMs) and random forest. DenseNet121 achieved the highest accuracy (90.2%), leveraging its superior feature extraction and generalization capabilities, while MobileNetV2 balanced accuracy (83.57%) with computational efficiency, processing images in 0.07 s, making it ideal for real-time deployment. Advanced preprocessing techniques, such as data augmentation, turbidity simulation, and transfer learning, were employed to enhance dataset robustness and address class imbalance. Hybrid models combining CNNs with traditional classifiers achieved intermediate accuracy with improved interpretability. Optimization techniques, including pruning and quantization, reduced model size by 73.7%, enabling real-time deployment on resource-constrained devices. Grad-CAM visualizations further enhanced interpretability by identifying key image regions influencing predictions. This study highlights the potential of CNN-based models for scalable, interpretable fish species classification, offering actionable insights for sustainable fisheries management and biodiversity conservation. Full article
(This article belongs to the Special Issue Machine Learning and Data Mining: Innovations in Big Data Analytics)
Show Figures

Graphical abstract

18 pages, 1031 KiB  
Article
An Ensemble Framework for Text Classification
by Eleni Kamateri and Michail Salampasis
Information 2025, 16(2), 85; https://doi.org/10.3390/info16020085 - 23 Jan 2025
Cited by 1 | Viewed by 867
Abstract
Ensemble learning can improve predictive performance compared to the performance of any of its constituents alone, while keeping computational demands manageable. However, no reference methodology is available for developing ensemble systems. In this paper, we adapt an ensemble framework for patent classification to [...] Read more.
Ensemble learning can improve predictive performance compared to the performance of any of its constituents alone, while keeping computational demands manageable. However, no reference methodology is available for developing ensemble systems. In this paper, we adapt an ensemble framework for patent classification to assist data scientists in creating flexible ensemble architectures for text classification by selecting a finite set of constituent base models from the many available alternatives. We analyze the axes along which someone can select base models of an ensemble system and propose a methodology for combining them. Moreover, we conduct experiments to compare the effectiveness of ensemble systems against base models and state-of-the-art methods on multiple datasets (three patent classification and two text classification datasets), including long and short texts and single- and/or multi-labeled texts. The results verify the generality of our framework and the effectiveness of ensemble systems, especially ensembles of classifiers trained on different data sections/metadata. Full article
(This article belongs to the Special Issue Machine Learning and Data Mining: Innovations in Big Data Analytics)
Show Figures

Graphical abstract

22 pages, 627 KiB  
Article
Fitness Approximation Through Machine Learning with Dynamic Adaptation to the Evolutionary State
by Itai Tzruia, Tomer Halperin, Moshe Sipper and Achiya Elyasaf
Information 2024, 15(12), 744; https://doi.org/10.3390/info15120744 - 21 Nov 2024
Viewed by 1384
Abstract
We present a novel approach to performing fitness approximation in genetic algorithms (GAs) using machine learning (ML) models, focusing on dynamic adaptation to the evolutionary state. We compare different methods for (1) switching between actual and approximate fitness, (2) sampling the population, and [...] Read more.
We present a novel approach to performing fitness approximation in genetic algorithms (GAs) using machine learning (ML) models, focusing on dynamic adaptation to the evolutionary state. We compare different methods for (1) switching between actual and approximate fitness, (2) sampling the population, and (3) weighting the samples. Experimental findings demonstrate significant improvement in evolutionary runtimes, with fitness scores that are either identical or slightly lower than those of the fully run GA—depending on the ratio of approximate-to-actual-fitness computation. Although we focus on evolutionary agents in Gymnasium (game) simulators—where fitness computation is costly—our approach is generic and can be easily applied to many different domains. Full article
(This article belongs to the Special Issue Machine Learning and Data Mining: Innovations in Big Data Analytics)
Show Figures

Graphical abstract

25 pages, 4208 KiB  
Article
Adaptive and Scalable Database Management with Machine Learning Integration: A PostgreSQL Case Study
by Maryam Abbasi, Marco V. Bernardo, Paulo Váz, José Silva and Pedro Martins
Information 2024, 15(9), 574; https://doi.org/10.3390/info15090574 - 18 Sep 2024
Cited by 1 | Viewed by 4170
Abstract
The increasing complexity of managing modern database systems, particularly in terms of optimizing query performance for large datasets, presents significant challenges that traditional methods often fail to address. This paper proposes a comprehensive framework for integrating advanced machine learning (ML) models within the [...] Read more.
The increasing complexity of managing modern database systems, particularly in terms of optimizing query performance for large datasets, presents significant challenges that traditional methods often fail to address. This paper proposes a comprehensive framework for integrating advanced machine learning (ML) models within the architecture of a database management system (DBMS), with a specific focus on PostgreSQL. Our approach leverages a combination of supervised and unsupervised learning techniques to predict query execution times, optimize performance, and dynamically manage workloads. Unlike existing solutions that address specific optimization tasks in isolation, our framework provides a unified platform that supports real-time model inference and automatic database configuration adjustments based on workload patterns. A key contribution of our work is the integration of ML capabilities directly into the DBMS engine, enabling seamless interaction between the ML models and the query optimization process. This integration allows for the automatic retraining of models and dynamic workload management, resulting in substantial improvements in both query response times and overall system throughput. Our evaluations using the Transaction Processing Performance Council Decision Support (TPC-DS) benchmark dataset at scale factors of 100 GB, 1 TB, and 10 TB demonstrate a reduction of up to 42% in query execution times and a 74% improvement in throughput compared with traditional approaches. Additionally, we address challenges such as potential conflicts in tuning recommendations and the performance overhead associated with ML integration, providing insights for future research directions. This study is motivated by the need for autonomous tuning mechanisms to manage large-scale, heterogeneous workloads while answering key research questions, such as the following: (1) How can machine learning models be integrated into a DBMS to improve query optimization and workload management? (2) What performance improvements can be achieved through dynamic configuration tuning based on real-time workload patterns? Our results suggest that the proposed framework significantly reduces the need for manual database administration while effectively adapting to evolving workloads, offering a robust solution for modern large-scale data environments. Full article
(This article belongs to the Special Issue Machine Learning and Data Mining: Innovations in Big Data Analytics)
Show Figures

Graphical abstract

26 pages, 15092 KiB  
Article
Exploring the Depths of the Autocorrelation Function: Its Departure from Normality
by Hossein Hassani, Manuela Royer-Carenzi, Leila Marvian Mashhad, Masoud Yarmohammadi and Mohammad Reza Yeganegi
Information 2024, 15(8), 449; https://doi.org/10.3390/info15080449 - 30 Jul 2024
Cited by 5 | Viewed by 2239
Abstract
In this article, we study the autocorrelation function (ACF), which is a crucial element in time series analysis. We compare the distribution of the ACF, both from a theoretical and empirical point of view. We focus on white noise processes (WN), i.e., uncorrelated, [...] Read more.
In this article, we study the autocorrelation function (ACF), which is a crucial element in time series analysis. We compare the distribution of the ACF, both from a theoretical and empirical point of view. We focus on white noise processes (WN), i.e., uncorrelated, centered, and identically distributed variables, whose ACFs are supposed to be asymptotically independent and converge towards the same normal distribution. But, the study of the sum of the sample ACF contradicts this property. Thus, our findings reveal a deviation of the sample ACF from normality beyond a specific lag. Note that this phenomenon is observed for white noise of varying lengths, and evenforn the residuals of an ARMA(p,q) model. This discovery challenges traditional assumptions of normality in time series modeling. Indeed, when modeling a time series, the crucial step is to validate the estimated model by checking that the associated residuals form white noise. In this study, we show that the widely used portmanteau tests are not completely accurate. Box–Pierce appears to be too conservative, whereas Ljung–Box is too liberal. We suggest an alternative method based on the ACF for establishing the reliability of the portmanteau test and the validity of the estimated model. We illustrate our methodology using money stock data in the USA. Full article
(This article belongs to the Special Issue Machine Learning and Data Mining: Innovations in Big Data Analytics)
Show Figures

Figure 1

Review

Jump to: Research

35 pages, 644 KiB  
Review
Machine Learning in Baseball Analytics: Sabermetrics and Beyond
by Wenbing Zhao, Vyaghri Seetharamayya Akella, Shunkun Yang and Xiong Luo
Information 2025, 16(5), 361; https://doi.org/10.3390/info16050361 - 29 Apr 2025
Abstract
In this article, we provide a comprehensive review of machine learning-based sports analytics in baseball. This review is primarily guided by the following three research questions: (1) What baseball analytics problems have been studied using machine learning? (2) What data repositories have been [...] Read more.
In this article, we provide a comprehensive review of machine learning-based sports analytics in baseball. This review is primarily guided by the following three research questions: (1) What baseball analytics problems have been studied using machine learning? (2) What data repositories have been used? (3) What and how machine learning techniques have been employed for these studies? The findings of these research questions lead to several research contributions. First, we provide a taxonomy for baseball analytics problems. According to the proposed taxonomy, machine learning has been employed to (1) predict individual game plays; (2) determine player performance; (3) estimate player valuation; (4) predict future player injuries; and (5) project future game outcomes. Second, we identify a set of data repositories for baseball analytics studies. The most popular data repositories are Baseball Savant and Baseball Reference. Third, we conduct an in-depth analysis of the machine learning models applied in baseball analytics. The most popular machine learning models are random forest and support vector machine. Furthermore, only a small fraction of studies have rigorously followed the best practices in data preprocessing, machine learning model training, testing, and prediction outcome interpretation. Full article
(This article belongs to the Special Issue Machine Learning and Data Mining: Innovations in Big Data Analytics)
Show Figures

Figure 1

Back to TopTop