Algorithms in Data Classification

A special issue of Algorithms (ISSN 1999-4893).

Deadline for manuscript submissions: closed (15 November 2023) | Viewed by 23523

Special Issue Editor


E-Mail Website
Guest Editor
Department of Informatics and Telecommunications, University of Ioannina, 45110 Ioannina, Greece
Interests: grammatical evolution; artificial intelligence; genetic algorithms; neuro evolution; genetic programming
Special Issues, Collections and Topics in MDPI journals

Special Issue Information

Dear Colleagues,

It is my pleasure to invite you to submit to this Special Issue on “Algorithms in Data Classification” of the reputable MDPI journal Algorithms. The aim of this Special Issue is to present recent advances in the area of data classification as well as to provide applications for real world problems.

Topics include but are not limited to:

  • Binary classification
  • Multi-class classification
  • Multi-label classification
  • Imbalanced classification
  • Feature selection for classification.
  • Probabilistic models for classification.
  • Big data classification.
  • Text classification.
  • Multimedia classification.
  • Uncertain data classification.
  • Methods used in classification such as Bayes methods, Stochastic gradient descent, K-NN, decision trees, SVM, neural networks
  • Applications of data classification: sentiment analysis, spam classification, document classification, image classification etc.

Prof. Dr. Ioannis Tsoulos
Guest Editor

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Algorithms is an international peer-reviewed open access monthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 1600 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • binary classification
  • multi-label classification
  • decision trees
  • neural networks
  • big data
  • bayes methods
  • K-NN methods
  • feature selection
  • machine learning
  • supervised learning

Related Special Issue

Published Papers (13 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Editorial

Jump to: Research

3 pages, 134 KiB  
Editorial
Special Issue “Algorithms in Data Classification”
by Ioannis G. Tsoulos
Algorithms 2024, 17(1), 5; https://doi.org/10.3390/a17010005 - 22 Dec 2023
Viewed by 1254
Abstract
Data classification is a well-known procedure, with many applications to real-world problems [...] Full article
(This article belongs to the Special Issue Algorithms in Data Classification)

Research

Jump to: Editorial

14 pages, 1085 KiB  
Article
On the Influence of Data Imbalance on Supervised Gaussian Mixture Models
by Luca Scrucca
Algorithms 2023, 16(12), 563; https://doi.org/10.3390/a16120563 - 11 Dec 2023
Viewed by 1533
Abstract
Imbalanced data present a pervasive challenge in many real-world applications of statistical and machine learning, where the instances of one class significantly outnumber those of the other. This paper examines the impact of class imbalance on the performance of Gaussian mixture models in [...] Read more.
Imbalanced data present a pervasive challenge in many real-world applications of statistical and machine learning, where the instances of one class significantly outnumber those of the other. This paper examines the impact of class imbalance on the performance of Gaussian mixture models in classification tasks and establishes the need for a strategy to reduce the adverse effects of imbalanced data on the accuracy and reliability of classification outcomes. We explore various strategies to address this problem, including cost-sensitive learning, threshold adjustments, and sampling-based techniques. Through extensive experiments on synthetic and real-world datasets, we evaluate the effectiveness of these methods. Our findings emphasize the need for effective mitigation strategies for class imbalance in supervised Gaussian mixtures, offering valuable insights for practitioners and researchers in improving classification outcomes. Full article
(This article belongs to the Special Issue Algorithms in Data Classification)
Show Figures

Figure 1

25 pages, 9322 KiB  
Article
Blood Cell Revolution: Unveiling 11 Distinct Types with ‘Naturalize’ Augmentation
by Mohamad Abou Ali, Fadi Dornaika and Ignacio Arganda-Carreras
Algorithms 2023, 16(12), 562; https://doi.org/10.3390/a16120562 - 10 Dec 2023
Cited by 1 | Viewed by 1799
Abstract
Artificial intelligence (AI) has emerged as a cutting-edge tool, simultaneously accelerating, securing, and enhancing the diagnosis and treatment of patients. An exemplification of this capability is evident in the analysis of peripheral blood smears (PBS). In university medical centers, hematologists routinely examine hundreds [...] Read more.
Artificial intelligence (AI) has emerged as a cutting-edge tool, simultaneously accelerating, securing, and enhancing the diagnosis and treatment of patients. An exemplification of this capability is evident in the analysis of peripheral blood smears (PBS). In university medical centers, hematologists routinely examine hundreds of PBS slides daily to validate or correct outcomes produced by advanced hematology analyzers assessing samples from potentially problematic patients. This process may logically lead to erroneous PBC readings, posing risks to patient health. AI functions as a transformative tool, significantly improving the accuracy and precision of readings and diagnoses. This study reshapes the parameters of blood cell classification, harnessing the capabilities of AI and broadening the scope from 5 to 11 specific blood cell categories with the challenging 11-class PBC dataset. This transformation facilitates a more profound exploration of blood cell diversity, surpassing prior constraints in medical image analysis. Our approach combines state-of-the-art deep learning techniques, including pre-trained ConvNets, ViTb16 models, and custom CNN architectures. We employ transfer learning, fine-tuning, and ensemble strategies, such as CBAM and Averaging ensembles, to achieve unprecedented accuracy and interpretability. Our fully fine-tuned EfficientNetV2 B0 model sets a new standard, with a macro-average precision, recall, and F1-score of 91%, 90%, and 90%, respectively, and an average accuracy of 93%. This breakthrough underscores the transformative potential of 11-class blood cell classification for more precise medical diagnoses. Moreover, our groundbreaking “Naturalize” augmentation technique produces remarkable results. The 2K-PBC dataset generated with “Naturalize” boasts a macro-average precision, recall, and F1-score of 97%, along with an average accuracy of 96% when leveraging the fully fine-tuned EfficientNetV2 B0 model. This innovation not only elevates classification performance but also addresses data scarcity and bias in medical deep learning. Our research marks a paradigm shift in blood cell classification, enabling more nuanced and insightful medical analyses. The “Naturalize” technique’s impact extends beyond blood cell classification, emphasizing the vital role of diverse and comprehensive datasets in advancing healthcare applications through deep learning. Full article
(This article belongs to the Special Issue Algorithms in Data Classification)
Show Figures

Figure 1

21 pages, 4889 KiB  
Article
A Case-Study Comparison of Machine Learning Approaches for Predicting Student’s Dropout from Multiple Online Educational Entities
by José Manuel Porras, Juan Alfonso Lara, Cristóbal Romero and Sebastián Ventura
Algorithms 2023, 16(12), 554; https://doi.org/10.3390/a16120554 - 3 Dec 2023
Viewed by 1672
Abstract
Predicting student dropout is a crucial task in online education. Traditionally, each educational entity (institution, university, faculty, department, etc.) creates and uses its own prediction model starting from its own data. However, that approach is not always feasible or advisable and may depend [...] Read more.
Predicting student dropout is a crucial task in online education. Traditionally, each educational entity (institution, university, faculty, department, etc.) creates and uses its own prediction model starting from its own data. However, that approach is not always feasible or advisable and may depend on the availability of data, local infrastructure, and resources. In those cases, there are various machine learning approaches for sharing data and/or models between educational entities, using a classical centralized machine learning approach or other more advanced approaches such as transfer learning or federated learning. In this paper, we used data from three different LMS Moodle servers representing homogeneous different-sized educational entities. We tested the performance of the different machine learning approaches for the problem of predicting student dropout with multiple educational entities involved. We used a deep learning algorithm as a predictive classifier method. Our preliminary findings provide useful information on the benefits and drawbacks of each approach, as well as suggestions for enhancing performance when there are multiple institutions. In our case, repurposed transfer learning, stacked transfer learning, and centralized approaches produced similar or better results than the locally trained models for most of the entities. Full article
(This article belongs to the Special Issue Algorithms in Data Classification)
Show Figures

Graphical abstract

18 pages, 4392 KiB  
Article
Assessing Algorithms Used for Constructing Confidence Ellipses in Multidimensional Scaling Solutions
by Panos Nikitas and Efthymia Nikita
Algorithms 2023, 16(12), 535; https://doi.org/10.3390/a16120535 - 24 Nov 2023
Viewed by 1200
Abstract
This paper assesses algorithms proposed for constructing confidence ellipses in multidimensional scaling (MDS) solutions and proposes a new approach to interpreting these confidence ellipses via hierarchical cluster analysis (HCA). It is shown that the most effective algorithm for constructing confidence ellipses involves the [...] Read more.
This paper assesses algorithms proposed for constructing confidence ellipses in multidimensional scaling (MDS) solutions and proposes a new approach to interpreting these confidence ellipses via hierarchical cluster analysis (HCA). It is shown that the most effective algorithm for constructing confidence ellipses involves the generation of simulated distances based on the original multivariate dataset and then the creation of MDS maps that are scaled, reflected, rotated, translated, and finally superimposed. For this algorithm, the stability measure of the average areas tends to zero with increasing sample size n following the power model, An−B, with positive B values ranging from 0.7 to 2 and high R-squared fitting values around 0.99. This algorithm was applied to create confidence ellipses in the MDS plots of squared Euclidean and Mahalanobis distances for continuous and binary data. It was found that plotting confidence ellipses in MDS plots offers a better visualization of the distance map of the populations under study compared to plotting single points. However, the confidence ellipses cannot eliminate the subjective selection of clusters in the MDS plot based simply on the proximity of the MDS points. To overcome this subjective selection, we should quantify the formation of clusters of proximal samples. Thus, in addition to the algorithm assessment, we propose a new approach that estimates all possible cluster probabilities associated with the confidence ellipses by applying HCA using distance matrices derived from these ellipses. Full article
(This article belongs to the Special Issue Algorithms in Data Classification)
Show Figures

Figure 1

17 pages, 8845 KiB  
Article
Utilizing Mixture Regression Models for Clustering Time-Series Energy Consumption of a Plastic Injection Molding Process
by Massimo Pacella, Matteo Mangini and Gabriele Papadia
Algorithms 2023, 16(11), 524; https://doi.org/10.3390/a16110524 - 15 Nov 2023
Cited by 1 | Viewed by 1237
Abstract
Considering the issue of energy consumption reduction in industrial plants, we investigated a clustering method for mining the time-series data related to energy consumption. The industrial case study considered in our work is one of the most energy-intensive processes in the plastics industry: [...] Read more.
Considering the issue of energy consumption reduction in industrial plants, we investigated a clustering method for mining the time-series data related to energy consumption. The industrial case study considered in our work is one of the most energy-intensive processes in the plastics industry: the plastic injection molding process. Concerning the industrial setting, the energy consumption of the injection molding machine was monitored across multiple injection molding cycles. The collected data were then analyzed to establish patterns and trends in the energy consumption of the injection molding process. To this end, we considered mixtures of regression models given their flexibility in modeling heterogeneous time series and clustering time series in an unsupervised machine learning framework. Given the assumption of autocorrelated data and exogenous variables in the mixture model, we implemented an algorithm for model fitting that combined autocorrelated observations with spline and polynomial regressions. Our results demonstrate an accurate grouping of energy-consumption profiles, where each cluster is related to a specific production schedule. The clustering method also provides a unique profile of energy consumption for each cluster, depending on the production schedule and regression approach (i.e., spline and polynomial). According to these profiles, information related to the shape of energy consumption was identified, providing insights into reducing the electrical demand of the plant. Full article
(This article belongs to the Special Issue Algorithms in Data Classification)
Show Figures

Figure 1

15 pages, 936 KiB  
Article
An Intelligent Injury Rehabilitation Guidance System for Recreational Runners Using Data Mining Algorithms
by Theodoros Tzelepis, George Matlis, Nikos Dimokas, Petros Karvelis, Paraskevi Malliou and Anastasia Beneka
Algorithms 2023, 16(11), 523; https://doi.org/10.3390/a16110523 - 15 Nov 2023
Viewed by 1102
Abstract
In recent years the number of people who exercise every day has increased dramatically. More precisely, due to COVID period many people have become recreational runners. Recreational running is a regular way to keep active and healthy at any age. Additionally, running is [...] Read more.
In recent years the number of people who exercise every day has increased dramatically. More precisely, due to COVID period many people have become recreational runners. Recreational running is a regular way to keep active and healthy at any age. Additionally, running is a popular physical exercise that offers numerous health advantages. However, recreational runners report a high incidence of musculoskeletal injuries due to running. The healthcare industry has been compelled to use information technology due to the quick rate of growth and developments in electronic systems, the internet, and telecommunications. Our proposed intelligent system uses data mining algorithms for the rehabilitation guidance of recreational runners with musculoskeletal discomfort. The system classifies recreational runners based on a questionnaire that has been built according to the severity, irritability, nature, stage, and stability model and advise them on the appropriate treatment plan/exercises to follow. Through rigorous testing across various case studies, our method has yielded highly promising results, underscoring its potential to significantly contribute to the well-being and rehabilitation of recreational runners facing musculoskeletal challenges. Full article
(This article belongs to the Special Issue Algorithms in Data Classification)
Show Figures

Figure 1

16 pages, 3965 KiB  
Article
Grammatical Evolution-Driven Algorithm for Efficient and Automatic Hyperparameter Optimisation of Neural Networks
by Gauri Vaidya, Meghana Kshirsagar and Conor Ryan
Algorithms 2023, 16(7), 319; https://doi.org/10.3390/a16070319 - 29 Jun 2023
Viewed by 1935
Abstract
Neural networks have revolutionised the way we approach problem solving across multiple domains; however, their effective design and efficient use of computational resources is still a challenging task. One of the most important factors influencing this process is model hyperparameters which vary significantly [...] Read more.
Neural networks have revolutionised the way we approach problem solving across multiple domains; however, their effective design and efficient use of computational resources is still a challenging task. One of the most important factors influencing this process is model hyperparameters which vary significantly with models and datasets. Recently, there has been an increased focus on automatically tuning these hyperparameters to reduce complexity and to optimise resource utilisation. From traditional human-intuitive tuning methods to random search, grid search, Bayesian optimisation, and evolutionary algorithms, significant advancements have been made in this direction that promise improved performance while using fewer resources. In this article, we propose HyperGE, a two-stage model for automatically tuning hyperparameters driven by grammatical evolution (GE), a bioinspired population-based machine learning algorithm. GE provides an advantage in that it allows users to define their own grammar for generating solutions, making it ideal for defining search spaces across datasets and models. We test HyperGE to fine-tune VGG-19 and ResNet-50 pre-trained networks using three benchmark datasets. We demonstrate that the search space is significantly reduced by a factor of ~90% in Stage 2 with fewer number of trials. HyperGE could become an invaluable tool within the deep learning community, allowing practitioners greater freedom when exploring complex problem domains for hyperparameter fine-tuning. Full article
(This article belongs to the Special Issue Algorithms in Data Classification)
Show Figures

Figure 1

16 pages, 3361 KiB  
Article
Distributed Fuzzy Cognitive Maps for Feature Selection in Big Data Classification
by K. Haritha, M. V. Judy, Konstantinos Papageorgiou, Vassilis C. Georgiannis and Elpiniki Papageorgiou
Algorithms 2022, 15(10), 383; https://doi.org/10.3390/a15100383 - 19 Oct 2022
Cited by 4 | Viewed by 1718
Abstract
The features of a dataset play an important role in the construction of a machine learning model. Because big datasets often have a large number of features, they may contain features that are less relevant to the machine learning task, which makes the [...] Read more.
The features of a dataset play an important role in the construction of a machine learning model. Because big datasets often have a large number of features, they may contain features that are less relevant to the machine learning task, which makes the process more time-consuming and complex. In order to facilitate learning, it is always recommended to remove the less significant features. The process of eliminating the irrelevant features and finding an optimal feature set involves comprehensively searching the dataset and considering every subset in the data. In this research, we present a distributed fuzzy cognitive map based learning-based wrapper method for feature selection that is able to extract those features from a dataset that play the most significant role in decision making. Fuzzy cognitive maps (FCMs) represent a hybrid computing technique combining elements of both fuzzy logic and cognitive maps. Using Spark’s resilient distributed datasets (RDDs), the proposed model can work effectively in a distributed manner for quick, in-memory processing along with effective iterative computations. According to the experimental results, when the proposed model is applied to a classification task, the features selected by the model help to expedite the classification process. The selection of relevant features using the proposed algorithm is on par with existing feature selection algorithms. In conjunction with a random forest classifier, the proposed model produced an average accuracy above 90%, as opposed to 85.6% accuracy when no feature selection strategy was adopted. Full article
(This article belongs to the Special Issue Algorithms in Data Classification)
Show Figures

Figure 1

28 pages, 6356 KiB  
Article
A Novel Adaptive FCM with Cooperative Multi-Population Differential Evolution Optimization
by Amit Banerjee and Issam Abu-Mahfouz
Algorithms 2022, 15(10), 380; https://doi.org/10.3390/a15100380 - 17 Oct 2022
Cited by 1 | Viewed by 1372
Abstract
Fuzzy c-means (FCM), the fuzzy variant of the popular k-means, has been used for data clustering when cluster boundaries are not well defined. The choice of initial cluster prototypes (or the initialization of cluster memberships), and the fact that the number of [...] Read more.
Fuzzy c-means (FCM), the fuzzy variant of the popular k-means, has been used for data clustering when cluster boundaries are not well defined. The choice of initial cluster prototypes (or the initialization of cluster memberships), and the fact that the number of clusters needs to be defined a priori are two major factors that can affect the performance of FCM. In this paper, we review algorithms and methods used to overcome these two specific drawbacks. We propose a new cooperative multi-population differential evolution method with elitism to identify near-optimal initial cluster prototypes and also determine the most optimal number of clusters in the data. The differential evolution populations use a smaller subset of the dataset, one that captures the same structure of the dataset. We compare the proposed methodology to newer methods proposed in the literature, with simulations performed on standard benchmark data from the UCI machine learning repository. Finally, we present a case study for clustering time-series patterns from sensor data related to real-time machine health monitoring using the proposed method. Simulation results are promising and show that the proposed methodology can be effective in clustering a wide range of datasets. Full article
(This article belongs to the Special Issue Algorithms in Data Classification)
Show Figures

Figure 1

19 pages, 1297 KiB  
Article
Detection and Classification of Unannounced Physical Activities and Acute Psychological Stress Events for Interventions in Diabetes Treatment
by Mohammad Reza Askari, Mahmoud Abdel-Latif, Mudassir Rashid, Mert Sevil and Ali Cinar
Algorithms 2022, 15(10), 352; https://doi.org/10.3390/a15100352 - 27 Sep 2022
Cited by 10 | Viewed by 1952
Abstract
Detection and classification of acute psychological stress (APS) and physical activity (PA) in daily lives of people with chronic diseases can provide precision medicine for the treatment of chronic conditions such as diabetes. This study investigates the classification of different types of APS [...] Read more.
Detection and classification of acute psychological stress (APS) and physical activity (PA) in daily lives of people with chronic diseases can provide precision medicine for the treatment of chronic conditions such as diabetes. This study investigates the classification of different types of APS and PA, along with their concurrent occurrences, using the same subset of feature maps via physiological variables measured by a wristband device. Random convolutional kernel transformation is used to extract a large number of feature maps from the biosignals measured by a wristband device (blood volume pulse, galvanic skin response, skin temperature, and 3D accelerometer signals). Three different feature selection techniques (principal component analysis, partial least squares–discriminant analysis (PLS-DA), and sequential forward selection) as well as four approaches for addressing imbalanced sizes of classes (upsampling, downsampling, adaptive synthetic sampling (ADASYN), and weighted training) are evaluated for maximizing detection and classification accuracy. A long short-term memory recurrent neural network model is trained to estimate PA (sedentary state, treadmill run, stationary bike) and APS (non-stress, emotional anxiety stress, mental stress) from wristband signals. The balanced accuracy scores for various combinations of data balancing and feature selection techniques range between 96.82% and 99.99%. The combination of PLS–DA for feature selection and ADASYN for data balancing provide the best overall performance. The detection and classification of APS and PA types along with their concurrent occurrences can provide precision medicine approaches for the treatment of diabetes. Full article
(This article belongs to the Special Issue Algorithms in Data Classification)
Show Figures

Figure 1

19 pages, 443 KiB  
Article
QFC: A Parallel Software Tool for Feature Construction, Based on Grammatical Evolution
by Ioannis G. Tsoulos
Algorithms 2022, 15(8), 295; https://doi.org/10.3390/a15080295 - 21 Aug 2022
Cited by 4 | Viewed by 2099
Abstract
This paper presents and analyzes a programming tool that implements a method for classification and function regression problems. This method builds new features from existing ones with the assistance of a hybrid algorithm that makes use of artificial neural networks and grammatical evolution. [...] Read more.
This paper presents and analyzes a programming tool that implements a method for classification and function regression problems. This method builds new features from existing ones with the assistance of a hybrid algorithm that makes use of artificial neural networks and grammatical evolution. The implemented software exploits modern multi-core computing units for faster execution. The method has been applied to a variety of classification and function regression problems, and an extensive comparison with other methods of computational intelligence is made. Full article
(This article belongs to the Special Issue Algorithms in Data Classification)
Show Figures

Figure 1

20 pages, 15507 KiB  
Article
Exploring the Efficiencies of Spectral Isolation for Intelligent Wear Monitoring of Micro Drill Bit Automatic Regrinding In-Line Systems
by Ugochukwu Ejike Akpudo and Jang-Wook Hur
Algorithms 2022, 15(6), 194; https://doi.org/10.3390/a15060194 - 6 Jun 2022
Viewed by 2014
Abstract
Despite the increasing digitalization of equipment diagnostic/condition monitoring systems, it remains a challenge to accurately harness discriminant information from multiple sensors with unique spectral (and transient) behaviors. High-precision systems such as the automatic regrinding in-line equipment provide intelligent regrinding of micro drill bits; [...] Read more.
Despite the increasing digitalization of equipment diagnostic/condition monitoring systems, it remains a challenge to accurately harness discriminant information from multiple sensors with unique spectral (and transient) behaviors. High-precision systems such as the automatic regrinding in-line equipment provide intelligent regrinding of micro drill bits; however, immediate monitoring of the grinder during the grinding process has become necessary because ignoring it directly affects the drill bit’s life and the equipment’s overall utility. Vibration signals from the frame and the high-speed grinding wheels reflect the different health stages of the grinding wheel and can be exploited for intelligent condition monitoring. The spectral isolation technique as a preprocessing tool ensures that only the critical spectral segments of the inputs are retained for improved diagnostic accuracy at reduced computational costs. This study explores artificial intelligence-based models for learning the discriminant spectral information stored in the vibration signals and considers the accuracy and cost implications of spectral isolation of the critical spectral segments of the signals for accurate equipment monitoring. Results from one-dimensional convolutional neural networks (1D-CNN) and multi-layer perceptron (MLP) neural networks, respectively, reveal that spectral isolation offers a higher condition monitoring accuracy at reduced computational costs. Experimental results using different 1D-CNN and MLP architectures reveal 4.6% and 7.5% improved diagnostic accuracy by the 1D-CNNs and MLPs, respectively, at about 1.3% and 5.71% reduced computational costs, respectively. Full article
(This article belongs to the Special Issue Algorithms in Data Classification)
Show Figures

Figure 1

Back to TopTop