MDPI - Publisher of Open Access Journals

42 pages, 13901 KiB

Open AccessFeature PaperArticle

Hybrid Explainable AI for Machine Predictive Maintenance: From Symbolic Expressions to Meta-Ensembles

by Nikola Anđelić, Sandi Baressi Šegota and Vedran Mrzljak

Processes 2025, 13(7), 2180; https://doi.org/10.3390/pr13072180 - 8 Jul 2025

Viewed by 419

Machine predictive maintenance plays a critical role in reducing unplanned downtime, lowering maintenance costs, and improving operational reliability by enabling the early detection and classification of potential failures. Artificial intelligence (AI) enhances these capabilities through advanced algorithms that can analyze complex sensor data [...] Read more.

Machine predictive maintenance plays a critical role in reducing unplanned downtime, lowering maintenance costs, and improving operational reliability by enabling the early detection and classification of potential failures. Artificial intelligence (AI) enhances these capabilities through advanced algorithms that can analyze complex sensor data with high accuracy and adaptability. This study introduces an explainable AI framework for failure detection and classification using symbolic expressions (SEs) derived from a genetic programming symbolic classifier (GPSC). Due to the imbalanced nature and wide variable ranges in the original dataset, we applied scaling/normalization and oversampling techniques to generate multiple balanced dataset variations. Each variation was used to train the GPSC with five-fold cross-validation, and optimal hyperparameters were selected using a Random Hyperparameter Value Search (RHVS) method. However, as the initial Threshold-Based Voting Ensembles (TBVEs) built from SEs did not achieve a satisfactory performance for all classes, a meta-dataset was developed from the outputs of the obtained SEs. For each class, a meta-dataset was preprocessed, balanced, and used to train a Random Forest Classifier (RFC) with hyperparameter tuning via RandomizedSearchCV. For each class, a TBVE was then constructed from the saved RFC models. The resulting ensemble demonstrated a near-perfect performance for failure detection and classification in most classes (0, 1, 3, and 5), although Classes 2 and 4 achieved a lower performance, which could be attributed to an extremely low number of samples and a hard-to-detect type of failure. Overall, the proposed method presents a robust and explainable AI solution for predictive maintenance, combining symbolic learning with ensemble-based meta-modeling. Full article

(This article belongs to the Special Issue Applications of Artificial Intelligence Technologies in Energy, Manufacturing and Automatic Control Processes)

► Show Figures

Figure 1

18 pages, 1463 KiB

Open AccessArticle

On Predicting Marine Engine Measurements with Synthetic Data in Scarce Dataset

by Sandi Baressi Šegota, Igor Poljak, Nikola Anđelić and Vedran Mrzljak

J. Mar. Sci. Eng. 2025, 13(7), 1289; https://doi.org/10.3390/jmse13071289 - 30 Jun 2025

Viewed by 252

Abstract

The scarcity of high-quality maritime datasets poses a significant challenge for machine learning (ML) applications in marine engineering, particularly in scenarios where real-world data collection is limited or impractical. This study investigates the effectiveness of synthetic data generation and cross-modeling in predicting operational [...] Read more.

The scarcity of high-quality maritime datasets poses a significant challenge for machine learning (ML) applications in marine engineering, particularly in scenarios where real-world data collection is limited or impractical. This study investigates the effectiveness of synthetic data generation and cross-modeling in predicting operational metrics of LNG carrier engines. A total of 38 real-world data points were collected from port and starboard engines, focusing on four target outputs: mechanical efficiency, fuel consumption, load, and effective power. CopulaGAN, a hybrid generative model combining statistical copulas and generative adversarial networks, was employed to produce synthetic datasets. These were used to train multilayer perceptron (MLP) regression models, which were optimized via grid search and validated through five-fold cross-validation. The results show that synthetic data can yield accurate models, with mean absolute percentage errors (MAPE) below 2% in most cases. The combined synthetic datasets consistently outperformed those generated from single-engine data. Cross-modeling was partially successful, as models trained on starboard data generalized well to port data but not vice versa. The engine load variable remained challenging to predict due to its narrow and low-range distribution. Overall, the study highlights synthetic data as a viable solution for enhancing the performance of ML models in data-scarce maritime applications. Full article

(This article belongs to the Section Ocean Engineering)

► Show Figures

Figure 1

13 pages, 12530 KiB

Open AccessArticle

Data Augmentation-Driven Improvements in Malignant Lymphoma Image Classification

by Sandi Baressi Šegota, Vedran Mrzljak, Ivan Lorencin and Nikola Anđelić

Computers 2025, 14(7), 252; https://doi.org/10.3390/computers14070252 - 26 Jun 2025

Viewed by 319

Abstract

Artificial intelligence (AI)-based techniques have become increasingly prevalent in the classification of medical images. However, the effectiveness of such methods is often constrained by the limited availability of annotated medical data. To address this challenge, data augmentation is frequently employed. This study investigates [...] Read more.

Artificial intelligence (AI)-based techniques have become increasingly prevalent in the classification of medical images. However, the effectiveness of such methods is often constrained by the limited availability of annotated medical data. To address this challenge, data augmentation is frequently employed. This study investigates the impact of a novel augmentation approach on the classification performance of malignant lymphoma histopathological images. The proposed method involves slicing high-resolution images (1388 × 1040 pixels) into smaller segments (224 × 224 pixels) before applying standard augmentation techniques such as flipping and rotation. The original dataset consists of 374 images, comprising 32.6% mantle cell lymphoma, 30.2% chronic lymphocytic leukemia, and 37.2% follicular lymphoma. Through slicing, the dataset was expanded to 8976 images, and further augmented to 53,856 images. The visual geometry group with 16 layers (VGG16) convolutional neural network (CNN) was trained and evaluated on three datasets: the original, the sliced, and the sliced with augmentation. Performance was assessed using accuracy, AUC, precision, sensitivity, specificity, and F1 score. The results demonstrate a substantial improvement in classification performance when slicing was employed, with additional, albeit smaller, gains achieved through subsequent augmentation. Full article

(This article belongs to the Special Issue Advanced Image Processing and Computer Vision (2nd Edition))

► Show Figures

Figure 1

28 pages, 2200 KiB

Open AccessArticle

Fine-Tuning Network Slicing in 5G: Unveiling Mathematical Equations for Precision Classification

by Nikola Anđelić, Sandi Baressi Šegota and Vedran Mrzljak

Computers 2025, 14(5), 159; https://doi.org/10.3390/computers14050159 - 25 Apr 2025

Viewed by 566

Abstract

Modern 5G network slicing centers on the precise design of virtual, independent networks operating over a shared physical infrastructure, each configured to meet specific service requirements. This approach plays a vital role in enabling highly customized and flexible service delivery within the 5G [...] Read more.

Modern 5G network slicing centers on the precise design of virtual, independent networks operating over a shared physical infrastructure, each configured to meet specific service requirements. This approach plays a vital role in enabling highly customized and flexible service delivery within the 5G ecosystem. In this study, we present the application of a genetic programming symbolic classifier to a dedicated network slicing dataset, resulting in the generation of accurate symbolic expressions for classifying different network slice types. To address the issue of class imbalance, we employ oversampling strategies that produce balanced variations of the dataset. Furthermore, a random search strategy is used to explore the hyperparameter space comprehensively in pursuit of optimal classification performance. The derived symbolic models, refined through threshold tuning based on prediction correctness, are subsequently evaluated on the original imbalanced dataset. The proposed method demonstrates outstanding performance, achieving a perfect classification accuracy of 1.0. Full article

(This article belongs to the Special Issue Distributed Computing Paradigms for the Internet of Things: Exploring Cloud, Edge, and Fog Solutions)

► Show Figures

Figure 1

22 pages, 2775 KiB

Open AccessArticle

Machine Learning Models for the Prediction of Wind Loads on Containerships

by Nastia Degiuli, Carlo Giorgio Grlj, Ivana Martić, Sandi Baressi Šegota, Nikola Anđelić and Darin Majnarić

J. Mar. Sci. Eng. 2025, 13(3), 417; https://doi.org/10.3390/jmse13030417 - 24 Feb 2025

Cited by 1 | Viewed by 770

Abstract

As the windage area of containerships increases, wind loads are becoming a more significant factor in navigating ships at open sea. This can lead to increased resistance and affect ship stability, maneuverability, and fuel efficiency. In this study, machine learning models based on [...] Read more.

As the windage area of containerships increases, wind loads are becoming a more significant factor in navigating ships at open sea. This can lead to increased resistance and affect ship stability, maneuverability, and fuel efficiency. In this study, machine learning models based on the multilayer perceptron and gradient-boosted tree methods were employed to predict wind load coefficients for containerships with various container configurations. Six models were developed to estimate longitudinal and transverse wind loads and moment coefficients using a comprehensive dataset generated by numerical simulations. Numerical simulations were conducted for two containerships with various container configurations at angles of attack ranging from 0° to 180°. The models showed satisfactory performance on an evaluation set, with high coefficients of determination. The models based on the gradient-boosted tree method slightly outperformed those based on the multilayer perceptron method, particularly in terms of mean absolute error. The study demonstrates that accurate prediction of wind load coefficients is feasible, making these models a reliable tool for practical engineering applications. Full article

(This article belongs to the Special Issue CFD Applications in Ship and Offshore Hydrodynamics 2nd Edition)

► Show Figures

Figure 1

49 pages, 17199 KiB

Open AccessArticle

Application of Symbolic Classifiers and Multi-Ensemble Threshold Techniques for Android Malware Detection

by Nikola Anđelić, Sandi Baressi Šegota and Vedran Mrzljak

Big Data Cogn. Comput. 2025, 9(2), 27; https://doi.org/10.3390/bdcc9020027 - 29 Jan 2025

Viewed by 903

Abstract

Android malware detection using artificial intelligence today is a mandatory tool to prevent cyber attacks. To address this problem in this paper the proposed methodology consists of the application of genetic programming symbolic classifier (GPSC) to obtain symbolic expressions (SEs) that can detect [...] Read more.

Android malware detection using artificial intelligence today is a mandatory tool to prevent cyber attacks. To address this problem in this paper the proposed methodology consists of the application of genetic programming symbolic classifier (GPSC) to obtain symbolic expressions (SEs) that can detect if the android is malware or not. To find the optimal combination of GPSC hyperparameter values the random hyperparameter values search method (RHVS) method and the GPSC were trained using 5-fold cross-validation (5FCV). It should be noted that the initial dataset is highly imbalanced (publicly available dataset). This problem was addressed by applying various preprocessing and oversampling techniques thus creating a huge number of balanced dataset variations and on each dataset variation the GPSC was trained. Since the dataset has many input variables three different approaches were considered: the initial investigation with all input variables, input variables with high feature importance, application of principal component analysis. After the SEs with the highest classification performance were obtained they were used in threshold-based voting ensembles and the threshold values were adjusted to improve classification performance. Multi-TBVE has been developed and using them the robust system for Android malware detection was achieved with the highest accuracy of 0.98 was obtained. Full article

(This article belongs to the Special Issue Big Data Analytics with Machine Learning for Cyber Security)

► Show Figures

Figure 1

18 pages, 1854 KiB

Open AccessArticle

Modeling of Actuation Force, Pressure and Contraction of Fluidic Muscles Based on Machine Learning

by Sandi Baressi Šegota, Mario Ključević, Dario Ogrizović and Zlatan Car

Technologies 2024, 12(9), 161; https://doi.org/10.3390/technologies12090161 - 12 Sep 2024

Viewed by 2370

Abstract

In this paper, the dataset is collected from the fluidic muscle datasheet. This dataset is then used to train models predicting the pressure, force, and contraction length of the fluidic muscle, as three separate outputs. This modeling is performed with four algorithms—extreme gradient [...] Read more.

In this paper, the dataset is collected from the fluidic muscle datasheet. This dataset is then used to train models predicting the pressure, force, and contraction length of the fluidic muscle, as three separate outputs. This modeling is performed with four algorithms—extreme gradient boosted trees (XGB), ElasticNet (ENet), support vector regressor (SVR), and multilayer perceptron (MLP) artificial neural network. Each of the four models of fluidic muscles (5-100N, 10-100N, 20-200N, 40-400N) is modeled separately: First, for a later comparison. Then, the combined dataset consisting of data from all the listed datasets is used for training. The results show that it is possible to achieve quality regression performance with the listed algorithms, especially with the general model, which performs better than individual models. Still, room for improvement exists, due to the high variance of the results across validation sets, possibly caused by non-normal data distributions. Full article

(This article belongs to the Section Manufacturing Technology)

► Show Figures

Figure 1

31 pages, 2279 KiB

Open AccessArticle

Achieving High Accuracy in Android Malware Detection through Genetic Programming Symbolic Classifier

by Nikola Anđelić and Sandi Baressi Šegota

Computers 2024, 13(8), 197; https://doi.org/10.3390/computers13080197 - 15 Aug 2024

Viewed by 1519

Abstract

The detection of Android malware is of paramount importance for safeguarding users’ personal and financial data from theft and misuse. It plays a critical role in ensuring the security and privacy of sensitive information on mobile devices, thereby preventing unauthorized access and potential [...] Read more.

The detection of Android malware is of paramount importance for safeguarding users’ personal and financial data from theft and misuse. It plays a critical role in ensuring the security and privacy of sensitive information on mobile devices, thereby preventing unauthorized access and potential damage. Moreover, effective malware detection is essential for maintaining device performance and reliability by mitigating the risks posed by malicious software. This paper introduces a novel approach to Android malware detection, leveraging a publicly available dataset in conjunction with a Genetic Programming Symbolic Classifier (GPSC). The primary objective is to generate symbolic expressions (SEs) that can accurately identify malware with high precision. To address the challenge of imbalanced class distribution within the dataset, various oversampling techniques are employed. Optimal hyperparameter configurations for GPSC are determined through a random hyperparameter values search (RHVS) method developed in this research. The GPSC model is trained using a 10-fold cross-validation (10FCV) technique, producing a set of 10 SEs for each dataset variation. Subsequently, the most effective SEs are integrated into a threshold-based voting ensemble (TBVE) system, which is then evaluated on the original dataset. The proposed methodology achieves a maximum accuracy of 0.956, thereby demonstrating its effectiveness for Android malware detection. Full article

► Show Figures

Figure 1

32 pages, 4740 KiB

Open AccessArticle

An Advanced Methodology for Crystal System Detection in Li-ion Batteries

by Nikola Anđelić and Sandi Baressi Šegota

Electronics 2024, 13(12), 2278; https://doi.org/10.3390/electronics13122278 - 10 Jun 2024

Cited by 1 | Viewed by 1413

Abstract

Detecting the crystal system of lithium-ion batteries is crucial for optimizing their performance and safety. Understanding the arrangement of atoms or ions within the battery’s electrodes and electrolyte allows for improvements in energy density, cycling stability, and safety features. This knowledge also guides [...] Read more.

Detecting the crystal system of lithium-ion batteries is crucial for optimizing their performance and safety. Understanding the arrangement of atoms or ions within the battery’s electrodes and electrolyte allows for improvements in energy density, cycling stability, and safety features. This knowledge also guides material design and fabrication techniques, driving advancements in battery technology for various applications. In this paper, a publicly available dataset was utilized to develop mathematical equations (MEs) using a genetic programming symbolic classifier (GPSC) to determine the type of crystal structure in Li-ion batteries with a high classification performance. The dataset consists of three different classes transformed into three binary classification datasets using a one-versus-rest approach. Since the target variable of each dataset variation is imbalanced, several oversampling techniques were employed to achieve balanced dataset variations. The GPSC was trained on these balanced dataset variations using a five-fold cross-validation (5FCV) process, and the optimal GPSC hyperparameter values were searched for using a random hyperparameter value search (RHVS) method. The goal was to find the optimal combination of GPSC hyperparameter values to achieve the highest classification performance. After obtaining MEs using the GPSC with the highest classification performance, they were combined and tested on initial binary classification dataset variations. Based on the conducted investigation, the ensemble of MEs could detect the crystal system of Li-ion batteries with a high classification accuracy (1.0). Full article

(This article belongs to the Section Industrial Electronics)

► Show Figures

Figure 1

28 pages, 2374 KiB

Open AccessArticle

Convolutional Neural Networks for Local Component Number Estimation from Time–Frequency Distributions of Multicomponent Nonstationary Signals

by Vedran Jurdana and Sandi Baressi Šegota

Mathematics 2024, 12(11), 1661; https://doi.org/10.3390/math12111661 - 26 May 2024

Cited by 4 | Viewed by 1054

Abstract

Frequency-modulated (FM) signals, prevalent across various applied disciplines, exhibit time-dependent frequencies and a multicomponent nature necessitating the utilization of time-frequency methods. Accurately determining the number of components in such signals is crucial for various applications reliant on this metric. However, this poses a [...] Read more.

Frequency-modulated (FM) signals, prevalent across various applied disciplines, exhibit time-dependent frequencies and a multicomponent nature necessitating the utilization of time-frequency methods. Accurately determining the number of components in such signals is crucial for various applications reliant on this metric. However, this poses a challenge, particularly amidst interfering components of varying amplitudes in noisy environments. While the localized Rényi entropy (LRE) method is effective for component counting, its accuracy significantly diminishes when analyzing signals with intersecting components, components that deviate from the time axis, and components with different amplitudes. This paper addresses these limitations and proposes a convolutional neural network-based (CNN) approach for determining the local number of components using a time–frequency distribution of a signal as input. A comprehensive training set comprising single and multicomponent linear and quadratic FM components with diverse time and frequency supports has been constructed, emphasizing special cases of noisy signals with intersecting components and differing amplitudes. The results demonstrate that the estimated component numbers outperform those obtained using the LRE method for considered noisy multicomponent synthetic signals. Furthermore, we validate the efficacy of the proposed CNN approach on real-world gravitational and electroencephalogram signals, underscoring its robustness and applicability across different signal types and conditions. Full article

(This article belongs to the Section E1: Mathematics and Computer Science)

► Show Figures

Figure 1

17 pages, 7131 KiB

Open AccessArticle

Regression Model for the Prediction of Total Motor Power Used by an Industrial Robot Manipulator during Operation

by Sandi Baressi Šegota, Nikola Anđelić, Jelena Štifanić and Zlatan Car

Machines 2024, 12(4), 225; https://doi.org/10.3390/machines12040225 - 28 Mar 2024

Cited by 1 | Viewed by 1629

Abstract

Motor power models are a key tool in robotics for modeling and simulations related to control and optimization. The authors collect the dataset of motor power using the ABB IRB 120 industrial robot. This paper applies a multilayer perceptron (MLP) model to the [...] Read more.

Motor power models are a key tool in robotics for modeling and simulations related to control and optimization. The authors collect the dataset of motor power using the ABB IRB 120 industrial robot. This paper applies a multilayer perceptron (MLP) model to the collected dataset. Before the training of MLP models, each of the variables in the dataset is evaluated using the random forest (RF) model, observing two metrics-mean decrease in impurity (MDI) and feature permutation score difference (FP). Pearson’s correlation coefficient was also applied Based on the scores of these values, a total of 15 variables, mainly static variables connected with the position and orientation of the robot, are eliminated from the dataset. The scores demonstrate that while both MLPs achieve good scores, the model trained on the pruned dataset performs better. With the model trained on the pruned dataset achieving

{\bar{R}}^{2} = 0.99924, σ = 0.00007

and

M \bar{A} P E = 0.33589, σ = 0.00955

, the model trained on the original, non-pruned, data achieves

{\bar{R}}^{2} = 0.98796, σ = 0.00081

and

M \bar{A} P E = 0.46895, σ = 0.05636

. These scores show that by eliminating the variables with a low influence from the dataset, a higher scoring model is achieved, and the created model achieves a better generalization performance across five folds used for evaluation. Full article

(This article belongs to the Special Issue Design and Control of Electrical Machines II)

► Show Figures

Figure 1

24 pages, 2536 KiB

Open AccessArticle

Enhancing Network Intrusion Detection: A Genetic Programming Symbolic Classifier Approach

by Nikola Anđelić and Sandi Baressi Šegota

Information 2024, 15(3), 154; https://doi.org/10.3390/info15030154 - 9 Mar 2024

Cited by 1 | Viewed by 2103

Abstract

This investigation underscores the paramount imperative of discerning network intrusions as a pivotal measure to fortify digital systems and shield sensitive data from unauthorized access, manipulation, and potential compromise. The principal aim of this study is to leverage a publicly available dataset, employing [...] Read more.

This investigation underscores the paramount imperative of discerning network intrusions as a pivotal measure to fortify digital systems and shield sensitive data from unauthorized access, manipulation, and potential compromise. The principal aim of this study is to leverage a publicly available dataset, employing a Genetic Programming Symbolic Classifier (GPSC) to derive symbolic expressions (SEs) endowed with the capacity for exceedingly precise network intrusion detection. In order to augment the classification precision of the SEs, a pioneering Random Hyperparameter Value Search (RHVS) methodology was conceptualized and implemented to discern the optimal combination of GPSC hyperparameter values. The GPSC underwent training via a robust five-fold cross-validation regimen, mitigating class imbalances within the initial dataset through the application of diverse oversampling techniques, thereby engendering balanced dataset iterations. Subsequent to the acquisition of SEs, the identification of the optimal set ensued, predicated upon metrics inclusive of accuracy, area under the receiver operating characteristics curve, precision, recall, and F1-score. The selected SEs were subsequently subjected to rigorous testing on the original imbalanced dataset. The empirical findings of this research underscore the efficacy of the proposed methodology, with the derived symbolic expressions attaining an impressive classification accuracy of 0.9945. If the accuracy achieved in this research is compared to the average state-of-the-art accuracy, the accuracy obtained in this research represents the improvement of approximately 3.78%. In summation, this investigation contributes salient insights into the efficacious deployment of GPSC and RHVS for the meticulous detection of network intrusions, thereby accentuating the potential for the establishment of resilient cybersecurity defenses. Full article

(This article belongs to the Special Issue Advanced Computer and Digital Technologies)

► Show Figures

Figure 1

17 pages, 2497 KiB

Open AccessArticle

Improvement of Machine Learning-Based Modelling of Container Ship’s Main Particulars with Synthetic Data

by Darin Majnarić, Sandi Baressi Šegota, Nikola Anđelić and Jerolim Andrić

J. Mar. Sci. Eng. 2024, 12(2), 273; https://doi.org/10.3390/jmse12020273 - 2 Feb 2024

Cited by 6 | Viewed by 1729

Abstract

One of the main problems in the application of machine learning techniques is the need for large amounts of data necessary to obtain a well-generalizing model. This is exacerbated for studies in which it is not possible to access large amounts of data—for [...] Read more.

One of the main problems in the application of machine learning techniques is the need for large amounts of data necessary to obtain a well-generalizing model. This is exacerbated for studies in which it is not possible to access large amounts of data—for example, in the case of ship main data modelling, where a limited amount of real-world data (ship main data) is available for dataset creation. In this paper, a synthetic data generation technique has been applied to generate a large amount of synthetic data points regarding container ships’ main particulars. Models are trained using a multilayer perceptron (MLP) regressor on both original and synthetic data mixed with original data points. Then, the authors validate the performance of the obtained models on the original data and conclude whether a synthetic-data-based approach can be used to develop models in instances where the amount of data on ship main particulars may be limited. The results demonstrate an improvement across almost all outputs, ranging between 0.01 and 0.21 when evaluated using the coefficient of determination (

R^{2}

) and between 0.27% and 3.43% when models are evaluated with mean absolute percentage error (MAPE). This indicates that the application of synthetic data can indeed be used for the improvement of ML-based model performance. The presented study demonstrates that the application of ML-based syncretization techniques can provide significant improvements to the process of ML-based determination of a ship’s main particulars at the early design stage. This paper suggests that, in cases where only a small dataset is available, artificial neural networks (ANN) can still be effectively employed to derive early-stage design values for the main particulars through the use of synthetic data. Full article

(This article belongs to the Special Issue Machine Learning and Modeling for Ship Design)

► Show Figures

Figure 1

21 pages, 1137 KiB

Open AccessArticle

Generating Mathematical Expressions for Estimation of Atomic Coordinates of Carbon Nanotubes Using Genetic Programming Symbolic Regression

by Nikola Anđelić and Sandi Baressi Šegota

Technologies 2023, 11(6), 185; https://doi.org/10.3390/technologies11060185 - 18 Dec 2023

Cited by 1 | Viewed by 2628

Abstract

The study addresses the formidable challenge of calculating atomic coordinates for carbon nanotubes (CNTs) using density functional theory (DFT), a process that can endure for days. To tackle this issue, the research leverages the Genetic Programming Symbolic Regression (GPSR) method on a publicly [...] Read more.

The study addresses the formidable challenge of calculating atomic coordinates for carbon nanotubes (CNTs) using density functional theory (DFT), a process that can endure for days. To tackle this issue, the research leverages the Genetic Programming Symbolic Regression (GPSR) method on a publicly available dataset. The primary aim is to assess if the resulting Mathematical Equations (MEs) from GPSR can accurately estimate calculated atomic coordinates obtained through DFT. Given the numerous hyperparameters in GPSR, a Random Hyperparameter Value Search (RHVS) method is devised to pinpoint the optimal combination of hyperparameter values, maximizing estimation accuracy. Two distinct approaches are considered. The first involves applying GPSR to estimate calculated coordinates (

u_{c}

,

v_{c}

,

w_{c}

) using all input variables (initial atomic coordinates u, v, w, and integers n, m specifying the chiral vector). The second approach applies GPSR to estimate each calculated atomic coordinate using integers n and m alongside the corresponding initial atomic coordinates. This results in the creation of six different dataset variations. The GPSR algorithm undergoes training via a 5-fold cross-validation process. The evaluation metrics include the coefficient of determination (

R^{2}

), mean absolute error (

M A E

), root mean squared error (

R M S E

), and the depth and length of generated MEs. The findings from this approach demonstrate that GPSR can effectively estimate CNT atomic coordinates with high accuracy, as indicated by an impressive

R^{2} \approx 1.0

. This study not only contributes to the advancement of accurate estimation techniques for atomic coordinates but also introduces a systematic approach for optimizing hyperparameters in GPSR, showcasing its potential for broader applications in materials science and computational chemistry. Full article

(This article belongs to the Section Innovations in Materials Science and Materials Processing)

► Show Figures

Figure 1

19 pages, 1486 KiB

Open AccessArticle

Improvement of Malicious Software Detection Accuracy through Genetic Programming Symbolic Classifier with Application of Dataset Oversampling Techniques

by Nikola Anđelić, Sandi Baressi Šegota and Zlatan Car

Computers 2023, 12(12), 242; https://doi.org/10.3390/computers12120242 - 21 Nov 2023

Cited by 5 | Viewed by 2497

Abstract

Malware detection using hybrid features, combining binary and hexadecimal analysis with DLL calls, is crucial for leveraging the strengths of both static and dynamic analysis methods. Artificial intelligence (AI) enhances this process by enabling automated pattern recognition, anomaly detection, and continuous learning, allowing [...] Read more.

Malware detection using hybrid features, combining binary and hexadecimal analysis with DLL calls, is crucial for leveraging the strengths of both static and dynamic analysis methods. Artificial intelligence (AI) enhances this process by enabling automated pattern recognition, anomaly detection, and continuous learning, allowing security systems to adapt to evolving threats and identify complex, polymorphic malware that may exhibit varied behaviors. This synergy of hybrid features with AI empowers malware detection systems to efficiently and proactively identify and respond to sophisticated cyber threats in real time. In this paper, the genetic programming symbolic classifier (GPSC) algorithm was applied to the publicly available dataset to obtain symbolic expressions (SEs) that could detect the malware software with high classification performance. The initial problem with the dataset was a high imbalance between class samples, so various oversampling techniques were utilized to obtain balanced dataset variations on which GPSC was applied. To find the optimal combination of GPSC hyperparameter values, the random hyperparameter value search method (RHVS) was developed and applied to obtain SEs with high classification accuracy. The GPSC was trained with five-fold cross-validation (5FCV) to obtain a robust set of SEs on each dataset variation. To choose the best SEs, several evaluation metrics were used, i.e., the length and depth of SEs, accuracy score (ACC), area under receiver operating characteristic curve (AUC), precision, recall, f1-score, and confusion matrix. The best-obtained SEs are applied on the original imbalanced dataset to see if the classification performance is the same as it was on balanced dataset variations. The results of the investigation showed that the proposed method generated SEs with high classification accuracy (0.9962) in malware software detection. Full article

► Show Figures

Figure 1

Search Results (39)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (39)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI