MDPI - Publisher of Open Access Journals

42 pages, 13901 KB

Open AccessFeature PaperEditor’s ChoiceArticle

Hybrid Explainable AI for Machine Predictive Maintenance: From Symbolic Expressions to Meta-Ensembles

by Nikola Anđelić, Sandi Baressi Šegota and Vedran Mrzljak

Processes 2025, 13(7), 2180; https://doi.org/10.3390/pr13072180 - 8 Jul 2025

Viewed by 2504

Machine predictive maintenance plays a critical role in reducing unplanned downtime, lowering maintenance costs, and improving operational reliability by enabling the early detection and classification of potential failures. Artificial intelligence (AI) enhances these capabilities through advanced algorithms that can analyze complex sensor data [...] Read more.

Machine predictive maintenance plays a critical role in reducing unplanned downtime, lowering maintenance costs, and improving operational reliability by enabling the early detection and classification of potential failures. Artificial intelligence (AI) enhances these capabilities through advanced algorithms that can analyze complex sensor data with high accuracy and adaptability. This study introduces an explainable AI framework for failure detection and classification using symbolic expressions (SEs) derived from a genetic programming symbolic classifier (GPSC). Due to the imbalanced nature and wide variable ranges in the original dataset, we applied scaling/normalization and oversampling techniques to generate multiple balanced dataset variations. Each variation was used to train the GPSC with five-fold cross-validation, and optimal hyperparameters were selected using a Random Hyperparameter Value Search (RHVS) method. However, as the initial Threshold-Based Voting Ensembles (TBVEs) built from SEs did not achieve a satisfactory performance for all classes, a meta-dataset was developed from the outputs of the obtained SEs. For each class, a meta-dataset was preprocessed, balanced, and used to train a Random Forest Classifier (RFC) with hyperparameter tuning via RandomizedSearchCV. For each class, a TBVE was then constructed from the saved RFC models. The resulting ensemble demonstrated a near-perfect performance for failure detection and classification in most classes (0, 1, 3, and 5), although Classes 2 and 4 achieved a lower performance, which could be attributed to an extremely low number of samples and a hard-to-detect type of failure. Overall, the proposed method presents a robust and explainable AI solution for predictive maintenance, combining symbolic learning with ensemble-based meta-modeling. Full article

(This article belongs to the Special Issue Applications of Artificial Intelligence Technologies in Energy, Manufacturing and Automatic Control Processes)

► Show Figures

Figure 1

28 pages, 2200 KB

Open AccessArticle

Fine-Tuning Network Slicing in 5G: Unveiling Mathematical Equations for Precision Classification

by Nikola Anđelić, Sandi Baressi Šegota and Vedran Mrzljak

Computers 2025, 14(5), 159; https://doi.org/10.3390/computers14050159 - 25 Apr 2025

Cited by 1 | Viewed by 1273

Abstract

Modern 5G network slicing centers on the precise design of virtual, independent networks operating over a shared physical infrastructure, each configured to meet specific service requirements. This approach plays a vital role in enabling highly customized and flexible service delivery within the 5G [...] Read more.

Modern 5G network slicing centers on the precise design of virtual, independent networks operating over a shared physical infrastructure, each configured to meet specific service requirements. This approach plays a vital role in enabling highly customized and flexible service delivery within the 5G ecosystem. In this study, we present the application of a genetic programming symbolic classifier to a dedicated network slicing dataset, resulting in the generation of accurate symbolic expressions for classifying different network slice types. To address the issue of class imbalance, we employ oversampling strategies that produce balanced variations of the dataset. Furthermore, a random search strategy is used to explore the hyperparameter space comprehensively in pursuit of optimal classification performance. The derived symbolic models, refined through threshold tuning based on prediction correctness, are subsequently evaluated on the original imbalanced dataset. The proposed method demonstrates outstanding performance, achieving a perfect classification accuracy of 1.0. Full article

(This article belongs to the Special Issue Distributed Computing Paradigms for the Internet of Things: Exploring Cloud, Edge, and Fog Solutions)

► Show Figures

Figure 1

49 pages, 17199 KB

Open AccessArticle

Application of Symbolic Classifiers and Multi-Ensemble Threshold Techniques for Android Malware Detection

by Nikola Anđelić, Sandi Baressi Šegota and Vedran Mrzljak

Big Data Cogn. Comput. 2025, 9(2), 27; https://doi.org/10.3390/bdcc9020027 - 29 Jan 2025

Viewed by 1697

Abstract

Android malware detection using artificial intelligence today is a mandatory tool to prevent cyber attacks. To address this problem in this paper the proposed methodology consists of the application of genetic programming symbolic classifier (GPSC) to obtain symbolic expressions (SEs) that can detect [...] Read more.

Android malware detection using artificial intelligence today is a mandatory tool to prevent cyber attacks. To address this problem in this paper the proposed methodology consists of the application of genetic programming symbolic classifier (GPSC) to obtain symbolic expressions (SEs) that can detect if the android is malware or not. To find the optimal combination of GPSC hyperparameter values the random hyperparameter values search method (RHVS) method and the GPSC were trained using 5-fold cross-validation (5FCV). It should be noted that the initial dataset is highly imbalanced (publicly available dataset). This problem was addressed by applying various preprocessing and oversampling techniques thus creating a huge number of balanced dataset variations and on each dataset variation the GPSC was trained. Since the dataset has many input variables three different approaches were considered: the initial investigation with all input variables, input variables with high feature importance, application of principal component analysis. After the SEs with the highest classification performance were obtained they were used in threshold-based voting ensembles and the threshold values were adjusted to improve classification performance. Multi-TBVE has been developed and using them the robust system for Android malware detection was achieved with the highest accuracy of 0.98 was obtained. Full article

(This article belongs to the Special Issue Big Data Analytics with Machine Learning for Cyber Security)

► Show Figures

Figure 1

31 pages, 2279 KB

Open AccessArticle

Achieving High Accuracy in Android Malware Detection through Genetic Programming Symbolic Classifier

by Nikola Anđelić and Sandi Baressi Šegota

Computers 2024, 13(8), 197; https://doi.org/10.3390/computers13080197 - 15 Aug 2024

Cited by 2 | Viewed by 2638

Abstract

The detection of Android malware is of paramount importance for safeguarding users’ personal and financial data from theft and misuse. It plays a critical role in ensuring the security and privacy of sensitive information on mobile devices, thereby preventing unauthorized access and potential [...] Read more.

The detection of Android malware is of paramount importance for safeguarding users’ personal and financial data from theft and misuse. It plays a critical role in ensuring the security and privacy of sensitive information on mobile devices, thereby preventing unauthorized access and potential damage. Moreover, effective malware detection is essential for maintaining device performance and reliability by mitigating the risks posed by malicious software. This paper introduces a novel approach to Android malware detection, leveraging a publicly available dataset in conjunction with a Genetic Programming Symbolic Classifier (GPSC). The primary objective is to generate symbolic expressions (SEs) that can accurately identify malware with high precision. To address the challenge of imbalanced class distribution within the dataset, various oversampling techniques are employed. Optimal hyperparameter configurations for GPSC are determined through a random hyperparameter values search (RHVS) method developed in this research. The GPSC model is trained using a 10-fold cross-validation (10FCV) technique, producing a set of 10 SEs for each dataset variation. Subsequently, the most effective SEs are integrated into a threshold-based voting ensemble (TBVE) system, which is then evaluated on the original dataset. The proposed methodology achieves a maximum accuracy of 0.956, thereby demonstrating its effectiveness for Android malware detection. Full article

► Show Figures

Figure 1

32 pages, 4740 KB

Open AccessArticle

An Advanced Methodology for Crystal System Detection in Li-ion Batteries

by Nikola Anđelić and Sandi Baressi Šegota

Electronics 2024, 13(12), 2278; https://doi.org/10.3390/electronics13122278 - 10 Jun 2024

Cited by 2 | Viewed by 2155

Abstract

Detecting the crystal system of lithium-ion batteries is crucial for optimizing their performance and safety. Understanding the arrangement of atoms or ions within the battery’s electrodes and electrolyte allows for improvements in energy density, cycling stability, and safety features. This knowledge also guides [...] Read more.

Detecting the crystal system of lithium-ion batteries is crucial for optimizing their performance and safety. Understanding the arrangement of atoms or ions within the battery’s electrodes and electrolyte allows for improvements in energy density, cycling stability, and safety features. This knowledge also guides material design and fabrication techniques, driving advancements in battery technology for various applications. In this paper, a publicly available dataset was utilized to develop mathematical equations (MEs) using a genetic programming symbolic classifier (GPSC) to determine the type of crystal structure in Li-ion batteries with a high classification performance. The dataset consists of three different classes transformed into three binary classification datasets using a one-versus-rest approach. Since the target variable of each dataset variation is imbalanced, several oversampling techniques were employed to achieve balanced dataset variations. The GPSC was trained on these balanced dataset variations using a five-fold cross-validation (5FCV) process, and the optimal GPSC hyperparameter values were searched for using a random hyperparameter value search (RHVS) method. The goal was to find the optimal combination of GPSC hyperparameter values to achieve the highest classification performance. After obtaining MEs using the GPSC with the highest classification performance, they were combined and tested on initial binary classification dataset variations. Based on the conducted investigation, the ensemble of MEs could detect the crystal system of Li-ion batteries with a high classification accuracy (1.0). Full article

(This article belongs to the Section Industrial Electronics)

► Show Figures

Figure 1

24 pages, 2536 KB

Open AccessArticle

Enhancing Network Intrusion Detection: A Genetic Programming Symbolic Classifier Approach

by Nikola Anđelić and Sandi Baressi Šegota

Information 2024, 15(3), 154; https://doi.org/10.3390/info15030154 - 9 Mar 2024

Cited by 2 | Viewed by 2725

Abstract

This investigation underscores the paramount imperative of discerning network intrusions as a pivotal measure to fortify digital systems and shield sensitive data from unauthorized access, manipulation, and potential compromise. The principal aim of this study is to leverage a publicly available dataset, employing [...] Read more.

This investigation underscores the paramount imperative of discerning network intrusions as a pivotal measure to fortify digital systems and shield sensitive data from unauthorized access, manipulation, and potential compromise. The principal aim of this study is to leverage a publicly available dataset, employing a Genetic Programming Symbolic Classifier (GPSC) to derive symbolic expressions (SEs) endowed with the capacity for exceedingly precise network intrusion detection. In order to augment the classification precision of the SEs, a pioneering Random Hyperparameter Value Search (RHVS) methodology was conceptualized and implemented to discern the optimal combination of GPSC hyperparameter values. The GPSC underwent training via a robust five-fold cross-validation regimen, mitigating class imbalances within the initial dataset through the application of diverse oversampling techniques, thereby engendering balanced dataset iterations. Subsequent to the acquisition of SEs, the identification of the optimal set ensued, predicated upon metrics inclusive of accuracy, area under the receiver operating characteristics curve, precision, recall, and F1-score. The selected SEs were subsequently subjected to rigorous testing on the original imbalanced dataset. The empirical findings of this research underscore the efficacy of the proposed methodology, with the derived symbolic expressions attaining an impressive classification accuracy of 0.9945. If the accuracy achieved in this research is compared to the average state-of-the-art accuracy, the accuracy obtained in this research represents the improvement of approximately 3.78%. In summation, this investigation contributes salient insights into the efficacious deployment of GPSC and RHVS for the meticulous detection of network intrusions, thereby accentuating the potential for the establishment of resilient cybersecurity defenses. Full article

(This article belongs to the Special Issue Advanced Computer and Digital Technologies)

► Show Figures

Figure 1

18 pages, 2909 KB

Open AccessArticle

Thiophene Stability in Photodynamic Therapy: A Mathematical Model Approach

by Jackson J. Alcázar

Int. J. Mol. Sci. 2024, 25(5), 2528; https://doi.org/10.3390/ijms25052528 - 21 Feb 2024

Cited by 5 | Viewed by 3764

Abstract

Thiophene-containing photosensitizers are gaining recognition for their role in photodynamic therapy (PDT). However, the inherent reactivity of the thiophene moiety toward singlet oxygen threatens the stability and efficiency of these photosensitizers. This study presents a novel mathematical model capable of predicting the reactivity [...] Read more.

Thiophene-containing photosensitizers are gaining recognition for their role in photodynamic therapy (PDT). However, the inherent reactivity of the thiophene moiety toward singlet oxygen threatens the stability and efficiency of these photosensitizers. This study presents a novel mathematical model capable of predicting the reactivity of thiophene toward singlet oxygen in PDT, using Conceptual Density Functional Theory (CDFT) and genetic programming. The research combines advanced computational methods, including various DFT techniques and symbolic regression, and is validated with experimental data. The findings underscore the capacity of the model to classify photosensitizers based on their photodynamic efficiency and safety, particularly noting that photosensitizers with a constant rate 1000 times lower than that of unmodified thiophene retain their photodynamic performance without substantial singlet oxygen quenching. Additionally, the research offers insights into the impact of electronic effects on thiophene reactivity. Finally, this study significantly advances thiophene-based photosensitizer design, paving the way for therapeutic agents that achieve a desirable balance between efficiency and safety in PDT. Full article

(This article belongs to the Special Issue Molecular Aspects of Photodynamic Therapy)

► Show Figures

Graphical abstract

19 pages, 1486 KB

Open AccessArticle

Improvement of Malicious Software Detection Accuracy through Genetic Programming Symbolic Classifier with Application of Dataset Oversampling Techniques

by Nikola Anđelić, Sandi Baressi Šegota and Zlatan Car

Computers 2023, 12(12), 242; https://doi.org/10.3390/computers12120242 - 21 Nov 2023

Cited by 7 | Viewed by 3218

Abstract

Malware detection using hybrid features, combining binary and hexadecimal analysis with DLL calls, is crucial for leveraging the strengths of both static and dynamic analysis methods. Artificial intelligence (AI) enhances this process by enabling automated pattern recognition, anomaly detection, and continuous learning, allowing [...] Read more.

Malware detection using hybrid features, combining binary and hexadecimal analysis with DLL calls, is crucial for leveraging the strengths of both static and dynamic analysis methods. Artificial intelligence (AI) enhances this process by enabling automated pattern recognition, anomaly detection, and continuous learning, allowing security systems to adapt to evolving threats and identify complex, polymorphic malware that may exhibit varied behaviors. This synergy of hybrid features with AI empowers malware detection systems to efficiently and proactively identify and respond to sophisticated cyber threats in real time. In this paper, the genetic programming symbolic classifier (GPSC) algorithm was applied to the publicly available dataset to obtain symbolic expressions (SEs) that could detect the malware software with high classification performance. The initial problem with the dataset was a high imbalance between class samples, so various oversampling techniques were utilized to obtain balanced dataset variations on which GPSC was applied. To find the optimal combination of GPSC hyperparameter values, the random hyperparameter value search method (RHVS) was developed and applied to obtain SEs with high classification accuracy. The GPSC was trained with five-fold cross-validation (5FCV) to obtain a robust set of SEs on each dataset variation. To choose the best SEs, several evaluation metrics were used, i.e., the length and depth of SEs, accuracy score (ACC), area under receiver operating characteristic curve (AUC), precision, recall, f1-score, and confusion matrix. The best-obtained SEs are applied on the original imbalanced dataset to see if the classification performance is the same as it was on balanced dataset variations. The results of the investigation showed that the proposed method generated SEs with high classification accuracy (0.9962) in malware software detection. Full article

► Show Figures

Figure 1

27 pages, 6621 KB

Open AccessArticle

Development of Symbolic Expressions Ensemble for Breast Cancer Type Classification Using Genetic Programming Symbolic Classifier and Decision Tree Classifier

by Nikola Anđelić and Sandi Baressi Šegota

Cancers 2023, 15(13), 3411; https://doi.org/10.3390/cancers15133411 - 29 Jun 2023

Cited by 10 | Viewed by 2763

Abstract

Breast cancer is a type of cancer with several sub-types. It occurs when cells in breast tissue grow out of control. The accurate sub-type classification of a patient diagnosed with breast cancer is mandatory for the application of proper treatment. Breast cancer classification [...] Read more.

Breast cancer is a type of cancer with several sub-types. It occurs when cells in breast tissue grow out of control. The accurate sub-type classification of a patient diagnosed with breast cancer is mandatory for the application of proper treatment. Breast cancer classification based on gene expression is challenging even for artificial intelligence (AI) due to the large number of gene expressions. The idea in this paper is to utilize the genetic programming symbolic classifier (GPSC) on the publicly available dataset to obtain a set of symbolic expressions (SEs) that can classify the breast cancer sub-type using gene expressions with high classification accuracy. The initial problem with the used dataset is a large number of input variables (54,676 gene expressions), a small number of dataset samples (151 samples), and six classes of breast cancer sub-types that are highly imbalanced. The large number of input variables is solved with principal component analysis (PCA), while the small number of samples and the large imbalance between class samples are solved with the application of different oversampling methods generating different dataset variations. On each oversampled dataset, the GPSC with random hyperparameter values search (RHVS) method is trained using 5-fold cross validation (5CV) to obtain a set of SEs. The best set of SEs is chosen based on mean values of accuracy (ACC), the area under the receiving operating characteristic curve (AUC), precision, recall, and F1-score values. In this case, the highest classification accuracy is equal to 0.992 across all evaluation metric methods. The best set of SEs is additionally combined with a decision tree classifier, which slightly improves ACC to 0.994. Full article

(This article belongs to the Special Issue Inherited Breast Cancer Risk: BRCA Mutations and Beyond)

► Show Figures

Figure 1

23 pages, 1718 KB

Open AccessArticle

Classification of Faults Operation of a Robotic Manipulator Using Symbolic Classifier

by Nikola Anđelić, Sandi Baressi Šegota, Matko Glučina and Ivan Lorencin

Appl. Sci. 2023, 13(3), 1962; https://doi.org/10.3390/app13031962 - 2 Feb 2023

Cited by 4 | Viewed by 3344

Abstract

In autonomous manufacturing lines, it is very important to detect the faulty operation of robot manipulators to prevent potential damage. In this paper, the application of a genetic programming algorithm (symbolic classifier) with a random selection of hyperparameter values and trained using a [...] Read more.

In autonomous manufacturing lines, it is very important to detect the faulty operation of robot manipulators to prevent potential damage. In this paper, the application of a genetic programming algorithm (symbolic classifier) with a random selection of hyperparameter values and trained using a 5-fold cross-validation process is proposed to determine expressions for fault detection during robotic manipulator operation, using a dataset that was made publicly available by the original researchers. The original dataset was reduced to a binary dataset (fault vs. normal operation); however, due to the class imbalance random oversampling, and SMOTE methods were applied. The quality of best symbolic expressions (SEs) was based on the highest mean values of accuracy (

\bar{A C C}

), area under receiving operating characteristics curve (

\bar{A U C}

),

\bar{P r e c i s i o n}

,

\bar{R e c a l l}

, and

\bar{F 1 - S c o r e}

. The best results were obtained on the SMOTE dataset with

\bar{A C C}

,

\bar{A U C}

,

\bar{P r e c i s i o n}

,

\bar{R e c a l l}

, and

\bar{F 1 - S c o r e}

equal to 0.99, 0.99, 0.992, 0.9893, and 0.99, respectively. Finally, the best set of mathematical equations obtained using the GPSC algorithm was evaluated on the initial dataset where the mean values of

\bar{A C C}

,

\bar{A U C}

,

\bar{P r e c i s i o n}

,

\bar{R e c a l l}

, and

\bar{F 1 - S c o r e}

are equal to 0.9978, 0.998, 1.0, 0.997, and 0.998, respectively. The investigation showed that using the described procedure, symbolically expressed models of a high classification performance are obtained for the purpose of detecting faults in the operation of robotic manipulators. Full article

(This article belongs to the Special Issue Robot Intelligence for Grasping and Manipulation)

► Show Figures

Figure 1

35 pages, 4044 KB

Open AccessArticle

Classification of Wall Following Robot Movements Using Genetic Programming Symbolic Classifier

by Nikola Anđelić, Sandi Baressi Šegota, Matko Glučina and Ivan Lorencin

Machines 2023, 11(1), 105; https://doi.org/10.3390/machines11010105 - 12 Jan 2023

Cited by 4 | Viewed by 3273

Abstract

The navigation of mobile robots throughout the surrounding environment without collisions is one of the mandatory behaviors in the field of mobile robotics. The movement of the robot through its surrounding environment is achieved using sensors and a control system. The application of [...] Read more.

The navigation of mobile robots throughout the surrounding environment without collisions is one of the mandatory behaviors in the field of mobile robotics. The movement of the robot through its surrounding environment is achieved using sensors and a control system. The application of artificial intelligence could potentially predict the possible movement of a mobile robot if a robot encounters potential obstacles. The data used in this paper is obtained from a wall-following robot that navigates through the room following the wall in a clockwise direction with the use of 24 ultrasound sensors. The idea of this paper is to apply genetic programming symbolic classifier (GPSC) with random hyperparameter search and 5-fold cross-validation to investigate if these methods could classify the movement in the correct category (move forward, slight right turn, sharp right turn, and slight left turn) with high accuracy. Since the original dataset is imbalanced, oversampling methods (ADASYN, SMOTE, and BorderlineSMOTE) were applied to achieve the balance between class samples. These over-sampled dataset variations were used to train the GPSC algorithm with a random hyperparameter search and 5-fold cross-validation. The mean and standard deviation of accuracy (

A C C

), the area under the receiver operating characteristic (

A U C

), precision, recall, and

F 1 - s c o r e

values were used to measure the classification performance of the obtained symbolic expressions. The investigation showed that the best symbolic expressions were obtained on a dataset balanced with the BorderlineSMOTE method with

\bar{A C C} \pm S D (A C C)

,

{\bar{A U C}}_{m a c r o} \pm S D (A U C)

,

{\bar{P r e c i s i o n}}_{m a c r o} \pm S D (P r e c i s i o n)

,

{\bar{R e c a l l}}_{m a c r o} \pm S D (R e c a l l)

, and

{\bar{F 1 - s c o r e}}_{m a c r o} \pm S D (F 1 - s c o r e)

equal to

0.975 \times 1.81 \times 10^{- 3}

,

0.997 \pm 6.37 \times 10^{- 4}

,

0.975 \pm 1.82 \times 10^{- 3}

,

0.976 \pm 1.59 \times 10^{- 3}

, and

0.9785 \pm 1.74 \times 10^{- 3}

, respectively. The final test was to use the set of best symbolic expressions and apply them to the original dataset. In this case the

\bar{A C C} \pm S D (A C C)

,

\bar{A U C} \pm S D (A U C)

,

\bar{P r e c i s i o n} \pm S D (P r e c i s i o n)

,

\bar{R e c a l l} \pm S D (R e c a l l)

, and

\bar{F 1 - s c o r e} \pm S D (F 1 - S c o r e)

are equal to

0.956 \pm 0.05

,

0.9536 \pm 0.057

,

0.9507 \pm 0.0275

,

0.9809 \pm 0.01

,

0.9698 \pm 0.00725

, respectively. The results of the investigation showed that this simple, non-linearly separable classification task could be solved using the GPSC algorithm with high accuracy. Full article

(This article belongs to the Special Issue Modeling, Sensor Fusion and Control Techniques in Applied Robotics)

► Show Figures

Figure 1

33 pages, 16386 KB

Open AccessArticle

The Development of Symbolic Expressions for the Detection of Hepatitis C Patients and the Disease Progression from Blood Parameters Using Genetic Programming-Symbolic Classification Algorithm

by Nikola Anđelić, Ivan Lorencin, Sandi Baressi Šegota and Zlatan Car

Appl. Sci. 2023, 13(1), 574; https://doi.org/10.3390/app13010574 - 31 Dec 2022

Cited by 7 | Viewed by 3165

Abstract

Hepatitis C is an infectious disease which is caused by the Hepatitis C virus (HCV) and the virus primarily affects the liver. Based on the publicly available dataset used in this paper the idea is to develop a mathematical equation that could be [...] Read more.

Hepatitis C is an infectious disease which is caused by the Hepatitis C virus (HCV) and the virus primarily affects the liver. Based on the publicly available dataset used in this paper the idea is to develop a mathematical equation that could be used to detect HCV patients with high accuracy based on the enzymes, proteins, and biomarker values contained in a patient’s blood sample using genetic programming symbolic classification (GPSC) algorithm. Not only that, but the idea was also to obtain a mathematical equation that could detect the progress of the disease i.e., Hepatitis C, Fibrosis, and Cirrhosis using the GPSC algorithm. Since the original dataset was imbalanced (a large number of healthy patients versus a small number of Hepatitis C/Fibrosis/Cirrhosis patients) the dataset was balanced using random oversampling, SMOTE, ADSYN, and Borderline SMOTE methods. The symbolic expressions (mathematical equations) were obtained using the GPSC algorithm using a rigorous process of 5-fold cross-validation with a random hyperparameter search method which had to be developed for this problem. To evaluate each symbolic expression generated with GPSC the mean and standard deviation values of accuracy (ACC), the area under the receiver operating characteristic curve (

A U C

), precision, recall, and F1-score were obtained. In a simple binary case (healthy vs. Hepatitis C patients) the best case was achieved with a dataset balanced with the Borderline SMOTE method. The results are

\bar{A C C} \pm S D (A C C)

,

\bar{A U C} \pm S D (A U C)

,

\bar{P r e c i s i o n} \pm S D (P r e c i s i o n)

,

\bar{R e c a l l} \pm S D (R e c a l l)

, and

\bar{F 1 - s c o r e} \pm S D (F 1 - s c o r e)

equal to

0.99 \pm 5.8 \times 10^{- 3}

,

0.99 \pm 5.4 \times 10^{- 3}

,

0.998 \pm 1.3 \times 10^{- 3}

,

0.98 \pm 1.19 \times 10^{- 3}

, and

0.99 \pm 5.39 \times 10^{- 3}

, respectively. For the multiclass problem, OneVsRestClassifer was used in combination with GPSC 5-fold cross-validation and random hyperparameter search, and the best case was achieved with a dataset balanced with the Borderline SMOTE method. To evaluate symbolic expressions obtained in this case previous evaluation metric methods were used however for

A U C

,

P r e c i s i o n

,

R e c a l l

, and

F 1 - s c o r e

the macro values were computed since this method calculates metrics for each label, and find their unweighted mean value. In multiclass case the

\bar{A C C} \pm S D (A C C)

,

{\bar{A U C}}_{m a c r o} \pm S D (A U C)

,

{\bar{P r e c i s i o n}}_{m a c r o} \pm S D (P r e c i s i o n)

,

{\bar{R e c a l l}}_{m a c r o} \pm S D (R e c a l l)

, and

{\bar{F 1 - s c o r e}}_{m a c r o} \pm S D (F 1 - s c o r e)

are equal to

0.934 \pm 9 \times 10^{- 3}

,

0.987 \pm 1.8 \times 10^{- 3}

,

0.942 \pm 6.9 \times 10^{- 3}

,

0.934 \pm 7.84 \times 10^{- 3}

and

0.932 \pm 8.4 \times 10^{- 3}

, respectively. For the best binary and multi-class cases, the symbolic expressions are shown and evaluated on the original dataset. Full article

(This article belongs to the Special Issue Deep Learning and Machine Learning in Biomedical Data)

► Show Figures

Figure 1

27 pages, 9438 KB

Open AccessArticle

The Development of Symbolic Expressions for Fire Detection with Symbolic Classifier Using Sensor Fusion Data

by Nikola Anđelić, Sandi Baressi Šegota, Ivan Lorencin and Zlatan Car

Sensors 2023, 23(1), 169; https://doi.org/10.3390/s23010169 - 24 Dec 2022

Cited by 12 | Viewed by 6461

Abstract

Fire is usually detected with fire detection systems that are used to sense one or more products resulting from the fire such as smoke, heat, infrared, ultraviolet light radiation, or gas. Smoke detectors are mostly used in residential areas while fire alarm systems [...] Read more.

Fire is usually detected with fire detection systems that are used to sense one or more products resulting from the fire such as smoke, heat, infrared, ultraviolet light radiation, or gas. Smoke detectors are mostly used in residential areas while fire alarm systems (heat, smoke, flame, and fire gas detectors) are used in commercial, industrial and municipal areas. However, in addition to smoke, heat, infrared, ultraviolet light radiation, or gas, other parameters could indicate a fire, such as air temperature, air pressure, and humidity, among others. Collecting these parameters requires the development of a sensor fusion system. However, with such a system, it is necessary to develop a simple system based on artificial intelligence (AI) that will be able to detect fire with high accuracy using the information collected from the sensor fusion system. The novelty of this paper is to show the procedure of how a simple AI system can be created in form of symbolic expression obtained with a genetic programming symbolic classifier (GPSC) algorithm and can be used as an additional tool to detect fire with high classification accuracy. Since the investigation is based on an initially imbalanced and publicly available dataset (high number of samples classified as 1-Fire Alarm and small number of samples 0-No Fire Alarm), the idea is to implement various balancing methods such as random undersampling/oversampling, Near Miss-1, ADASYN, SMOTE, and Borderline SMOTE. The obtained balanced datasets were used in GPSC with random hyperparameter search combined with 5-fold cross-validation to obtain symbolic expressions that could detect fire with high classification accuracy. For this investigation, the random hyperparameter search method and 5-fold cross-validation had to be developed. Each obtained symbolic expression was evaluated on train and test datasets to obtain mean and standard deviation values of accuracy (

A C C

), area under the receiver operating characteristic curve (

A U C

), precision, recall, and

F 1 - s c o r e

. Based on the conducted investigation, the highest classification metric values were achieved in the case of the dataset balanced with SMOTE method. The obtained values of

\bar{A C C} \pm S D (A C C)

,

\bar{A U C} \pm S D (A C U)

,

\bar{P r e c i s i o n} \pm S D (P r e c i s i o n)

,

\bar{R e c a l l} \pm S D (R e c a l l)

, and

\bar{F 1 - s c o r e} \pm S D (F 1 - s c o r e)

are equal to

0.998 \pm 4.79 \times 10^{- 5}

,

0.998 \pm 4.79 \times 10^{- 5}

,

0.999 \pm 5.32 \times 10^{- 5}

,

0.998 \pm 4.26 \times 10^{- 5}

, and

0.998 \pm 4.796 \times 10^{- 5}

, respectively. The symbolic expression using which best values of classification metrics were achieved is shown, and the final evaluation was performed on the original dataset. Full article

(This article belongs to the Special Issue Recent Advancements in Olfaction and Electronic Nose)

► Show Figures

Figure 1

30 pages, 2787 KB

Open AccessArticle

Detection of Malicious Websites Using Symbolic Classifier

by Nikola Anđelić, Sandi Baressi Šegota, Ivan Lorencin and Matko Glučina

Future Internet 2022, 14(12), 358; https://doi.org/10.3390/fi14120358 - 29 Nov 2022

Cited by 6 | Viewed by 4135

Abstract

Malicious websites are web locations that attempt to install malware, which is the general term for anything that will cause problems in computer operation, gather confidential information, or gain total control over the computer. In this paper, a novel approach is proposed which [...] Read more.

Malicious websites are web locations that attempt to install malware, which is the general term for anything that will cause problems in computer operation, gather confidential information, or gain total control over the computer. In this paper, a novel approach is proposed which consists of the implementation of the genetic programming symbolic classifier (GPSC) algorithm on a publicly available dataset to obtain a simple symbolic expression (mathematical equation) which could detect malicious websites with high classification accuracy. Due to a large imbalance of classes in the initial dataset, several data sampling methods (random undersampling/oversampling, ADASYN, SMOTE, BorderlineSMOTE, and KmeansSMOTE) were used to balance the dataset classes. For this investigation, the hyperparameter search method was developed to find the combination of GPSC hyperparameters with which high classification accuracy could be achieved. The first investigation was conducted using GPSC with a random hyperparameter search method and each dataset variation was divided on a train and test dataset in a ratio of 70:30. To evaluate each symbolic expression, the performance of each symbolic expression was measured on the train and test dataset and the mean and standard deviation values of accuracy (ACC),

A U C

, precision, recall and f1-score were obtained. The second investigation was also conducted using GPSC with the random hyperparameter search method; however, 70%, i.e., the train dataset, was used to perform 5-fold cross-validation. If the mean accuracy,

A U C

, precision, recall, and f1-score values were above 0.97 then final training and testing (train/test 70:30) were performed with GPSC with the same randomly chosen hyperparameters used in a 5-fold cross-validation process and the final mean and standard deviation values of the aforementioned evaluation methods were obtained. In both investigations, the best symbolic expression was obtained in the case where the dataset balanced with the KMeansSMOTE method was used for training and testing. The best symbolic expression obtained using GPSC with the random hyperparameter search method and classic train–test procedure (70:30) on a dataset balanced with the KMeansSMOTE method achieved values of

\bar{A C C}

,

\bar{A U C}

,

\bar{P r e c s i o n}

,

\bar{R e c a l l}

and

\bar{F 1 - s c o r e}

(with standard deviation)

0.9992 \pm 2.249 \times 10^{- 5}

,

0.9995 \pm 9.945 \times 10^{- 6}

,

0.9995 \pm 1.09 \times 10^{- 5}

,

0.999 \pm 5.17 \times 10^{- 5}

,

0.9992 \pm 5.17 \times 10^{- 6}

, respectively. The best symbolic expression obtained using GPSC with a random hyperparameter search method and 5-fold cross-validation on a dataset balanced with the KMeansSMOTE method achieved values of

\bar{A C C}

,

\bar{A U C}

,

\bar{P r e c s i o n}

,

\bar{R e c a l l}

and

\bar{F 1 - s c o r e}

(with standard deviation)

0.9994 \pm 1.13 \times 10^{- 5}

,

0.9994 \pm 1.2 \times 10^{- 5}

,

1.0 \pm 0

,

0.9988 \pm 2.4 \times 10^{- 5}

, and

0.9994 \pm 1.2 \times 10^{- 5}

, respectively. Full article

(This article belongs to the Special Issue Trends of Data Science and Knowledge Discovery)

► Show Figures

Figure 1

22 pages, 2342 KB

Open AccessArticle

Investigating the Physics of Tokamak Global Stability with Interpretable Machine Learning Tools

by Andrea Murari, Emmanuele Peluso, Michele Lungaroni, Riccardo Rossi, Michela Gelfusa and JET Contributors

Appl. Sci. 2020, 10(19), 6683; https://doi.org/10.3390/app10196683 - 24 Sep 2020

Cited by 20 | Viewed by 4783

Abstract

The inadequacies of basic physics models for disruption prediction have induced the community to increasingly rely on data mining tools. In the last decade, it has been shown how machine learning predictors can achieve a much better performance than those obtained with manually [...] Read more.

The inadequacies of basic physics models for disruption prediction have induced the community to increasingly rely on data mining tools. In the last decade, it has been shown how machine learning predictors can achieve a much better performance than those obtained with manually identified thresholds or empirical descriptions of the plasma stability limits. The main criticisms of these techniques focus therefore on two different but interrelated issues: poor “physics fidelity” and limited interpretability. Insufficient “physics fidelity” refers to the fact that the mathematical models of most data mining tools do not reflect the physics of the underlying phenomena. Moreover, they implement a black box approach to learning, which results in very poor interpretability of their outputs. To overcome or at least mitigate these limitations, a general methodology has been devised and tested, with the objective of combining the predictive capability of machine learning tools with the expression of the operational boundary in terms of traditional equations more suited to understanding the underlying physics. The proposed approach relies on the application of machine learning classifiers (such as Support Vector Machines or Classification Trees) and Symbolic Regression via Genetic Programming directly to experimental databases. The results are very encouraging. The obtained equations of the boundary between the safe and disruptive regions of the operational space present almost the same performance as the machine learning classifiers, based on completely independent learning techniques. Moreover, these models possess significantly better predictive power than traditional representations, such as the Hugill or the beta limit. More importantly, they are realistic and intuitive mathematical formulas, which are well suited to supporting theoretical understanding and to benchmarking empirical models. They can also be deployed easily and efficiently in real-time feedback systems. Full article

(This article belongs to the Special Issue Recent Developments in Fusion Plasma Diagnostics)

► Show Figures

Figure 1

Search Results (17)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (17)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI