Automating the Analysis of Negative Test Verdicts: A Future-Forward Approach Supported by Augmented Intelligence Algorithms

Gnacy-Gajdzik, Anna; Przystałka, Piotr

doi:10.3390/app14062304

Open AccessArticle

Automating the Analysis of Negative Test Verdicts: A Future-Forward Approach Supported by Augmented Intelligence Algorithms

by

Anna Gnacy-Gajdzik

^1,2,3,*

and

Piotr Przystałka

¹

Department of Fundamentals of Machinery Design, Silesian University of Technology, 18a Konarskiego Street, 44-100 Gliwice, Poland

²

Doctoral School, Silesian University of Technology, 2a Akademicka Street, 44-100 Gliwice, Poland

³

DIP Draexlmaier Engineering Poland Sp. z o.o, 44-100 Gliwice, Poland

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(6), 2304; https://doi.org/10.3390/app14062304

Submission received: 31 January 2024 / Revised: 26 February 2024 / Accepted: 6 March 2024 / Published: 9 March 2024

Download

Browse Figures

Versions Notes

Abstract

:

In the epoch characterized by the anticipation of autonomous vehicles, the quality of the embedded system software, its reliability, safety, and security is significant. The testing of embedded software is an increasingly significant element of the development process. The application of artificial intelligence (AI) algorithms in the process of testing embedded software in vehicles constitutes a significant area of both research and practical consideration, arising from the escalating complexity of these systems. This paper presents the preliminary development of the AVESYS framework which facilitates the application of open-source artificial intelligence algorithms in the embedded system testing process. The aim of this work is to evaluate its effectiveness in identifying anomalies in the test environment that could potentially affect testing results. The raw data from the test environment, mainly communication signals and readings from temperature, as well as current and voltage sensors are pre-processed and used to train machine learning models. A verification study is carried out, proving the high practical potential of the application of AI algorithms in embedded software testing.

Keywords:

testing embedded systems; artificial neural network; automation; machine learning; anomaly detection; automotive

1. Introduction

A malfunctioning automotive embedded system can potentially cause serious accidents, endangering the safety of people. Therefore, it is important to detect any defects in the embedded software through various testing processes [1]. Guaranteeing the correct operation of the electronic systems incorporating the vehicle is highly complex, and their expected functionality must be ensured in extremely adverse environmental conditions. In fact, vehicles are exposed to vibration, noise, extreme temperatures and electromagnetic fields that can affect and degrade electronic components [2]. To test embedded software as quickly and comprehensively as possible, verification and validation of its functionality and quality should start as early as possible. Embedded system vendors aim to automate testing processes in order to reduce the cost and time required for testing. This is especially crucial in the context of the complexity and variety of features being tested. Axelrod points out that effective automation requires control over all data entered into the system that can affect the output values, including data from external systems we depend on [3]. Test automation is becoming a key element in the testing process, which consists of three stages: test case generation, test execution, and result analysis [1]. Artificial intelligence is anticipated to impact testing across a wide range of software domains, including but not limited to mobile and web applications, IoT, embedded systems, database applications, the gaming industry, real-time applications, and critical applications [4].

Hourani [5] mentions in his article that AI already enhances the software testing process in various aspects. The first step in the testing process in which artificial intelligence algorithms can be involved is the analysis of functional requirements and the automatic generation of test cases. Verma and Beg [6] suggest an approach based on the natural language processing technique, requiring an automation tool and the usage of a database to store the generated graphs. Ansari [7] also propose a system based on NLP which can reduce the effort and time consumed by the software tester by extracting test cases automatically from Software Requirement Specification. Moghdam et al. use model-free reinforcement learning to construct a self-adaptive autonomous stress-testing structure. This structure is able to obtain the optimal policy for generating stress test cases without having a model of the system under test. The experimental analysis results show that the proposed intelligent structure effectively and adaptively generates stress test conditions for different software systems, even in the absence of performance models [8]. Kikuma et al. introduce a technique for the automatic extraction of homogeneous test cases from requirement specification documents, eliminating dependency on the skills and expertise of the engineer tasked with test case creation [9].

Based on the results of the Khaliq et al. study, it can be concluded that the application of artificial intelligence techniques has significantly improved the test case generation and design process. The authors mention that recent research has mainly focused on areas such as test case generation, prioritization, data generation, and oracle construction. It is worth noting that some software testing activities, including test harness implementation, test technique selection, test repair and vulnerability analysis, are omitted from their manuscript due to the limited availability of AI-based research in these areas [4].

Complex system functionality, changing customer requirements, and evolving regulations, especially in the automotive industry, lead to changes in embedded systems during and after development. This requires repeating the testing process with each software or hardware change, emphasizing the need for test automation to reduce costs, shorten the overall process, and improve efficiency [10].

To the increase in testing and maintenance tasks must be added the effort and time required to debug and resolve failed tests [11]. One critical aspect of this process is the analysis of negative test verdicts, which has traditionally required significant human intervention and expertise. The traditional approach to analyzing negative test results required testers to meticulously review test results, identify issues and determine root causes. This process, while effective, was time-consuming, resource-intensive and vulnerable to human error. Moreover, as the complexity of systems and software increased, the volume of test data grew exponentially, making manual analysis increasingly impractical.

In this work, we introduce a concept that enables the application of artificial intelligence algorithms to assist embedded system engineers in the process of validating test results. It supports the process of analyzing negative results and identifies cases where the test environment, rather than a software defect, is the cause of test failure. With the advent of artificial intelligence algorithms, the prospect of automating this crucial aspect of testing is now within reach, promising not only to streamline the testing process, but also to improve its overall reliability. The aim of our work is to evaluate the applicability of artificial intelligence algorithms in the process of testing embedded software in a real industrial project. We identify the limitations of the proposed solution and the direction of further work on the implementation of neural networks in the investigated process.

Implementing advanced AI methods and algorithms in mechanical engineering diagnostics offers significant potential for improving maintenance, planning, and process optimization.

Tools for testing embedded software at the unit test and integration test levels are usually closely integrated with development tools. They enable not only the automatic execution of designed test cases, but often also the automatic generation of test cases based on code or software architecture (UML diagrams). Test execution reports can also be automatically generated [12].

Machine learning has emerged as a powerful technique for autonomously learning correct system behavior that meets specified requirements [13]. With the increasing complexity of embedded systems, not all possible interactions can be manually defined as test cases. It is crucial to note that additional input data, when introduced automatically, must align with component behavior patterns to ensure the test corresponds to the system’s actual behavior. Bielefeldt proposes employing deep learning to obtain new, realistic test data and identify new embedded system behavior patterns [14]. Augmented intelligence applied in embedded system testing serves as a robust tool acting as a competent and comprehensive expert system. Considering input data regarding the embedded system and its hardware components, the most suitable set of tests can be suggested to ensure full compliance with product requirements and alert testers to inconsistencies or anomalies in the testing system.

Regarding the analysis of negative test results, it could be mentioned that Mokthari [15] introduces a system called a Measurement Intrusion Detection System (MIDS) for detecting anomalies in industrial control systems. It utilizes supervised machine learning to classify normal and abnormal activity in the system. Choosing the right anomaly detection algorithm is essential for real-time applications. Feature selection is critical to reducing computation time and improving model accuracy. Mokthari [15] compares a few learning algorithms (KNN, DTC, Random Forest) in the anomaly detection task. Evaluation criteria include classification accuracy and processing time for training and prediction. The Random Forest algorithm demonstrates the best anomaly detection performance and consumes the least time in model generation, while the Decision Tree algorithm has the lowest prediction computation time. The KNN algorithm exhibits lower accuracy in predicting anomalies and requires more time for both training and prediction. Despite its advantages, supervised learning has limitations that can impact the effectiveness and usability of models trained using this approach. It requires labeled training data, which can be costly, time-consuming, or impractical to obtain. Additionally, the quality and accuracy of labels can affect model performance. Manual labeling can introduce human errors that are challenging to diagnose and correct. Supervised learning models are trained on a specific training dataset, limiting their ability to generalize. The dependence on the representativeness of training data is another consideration when deploying supervised learning models. If the training set does not reflect the full range of possible scenarios in the real environment, the model may be inadequately prepared to handle diverse situations. Overfitting sensitivity is also a drawback of supervised learning. If a model overfits the training data, it may not perform well on new data. In the analysis and forecasting of time series data, deep learning methods are preferred for their ability to autonomously identify relevant features during training. Applying machine learning to analyze data from numerous sensors is a challenge; aiming to detect faults as early as possible before machine replacement or repair and system downtime is necessary. Extracting essential features from a large amount of sensor data is a critical challenge. Yi-Wei Lu proposes a model based on the AE-GRU architecture, where the autoencoder (AE) extracts significant features from raw data, and the Gated Recurrent Unit (GRU) selects information from the sequence for forecasting [16].

Nevertheless, it is important to highlight that there are several additional challenges associated with embedded software testing, where AI can assist and optimize the entire process.

In this study, we aim to investigate the applicability of openly available artificial intelligence algorithms designed to detect outliers in the context of embedded software testing. The objective is to assess their effectiveness in identifying anomalies within the test environment that have the potential to interfere with the results of embedded software testing. The results obtained will be utilized in further work to automate the step of verifying negative test results with artificial intelligence algorithms so that it is cost effective and efficient to implement this in a variety of real industrial automotive projects.

2. Materials and Methods

The authors of this paper propose the AVESYS framework, which is an extension of an embedded system test environment. It includes artificial intelligence algorithms applied to support the automation of the testing process. AVESYS is presented in Figure 1.

AVESYS is an artificial intelligence-based system designed to validate negative test results. It uses artificial intelligence techniques to analyze test results, identify anomalies or failures in the test environment, and validate negative test results. It can enhance the automated validation process using AI algorithms to understand complex system behavior and anomalies within it.

The initial phase of AVESYS involves data collection from the test environment during test execution. Testing of embedded systems is typically performed in a Hardware in The Loop (HIL) environment. Data from real and simulated sensors, data from communication buses and information about the environment’s control signals are collected in a single log file at a predefined sampling rate. For the system under research, this is 1 ms. In the experimental test environment, a vehicle model is used in which a recording of the communication buses recorded in an electric vehicle during a 20 min drive around the city is used. Training data acquisition includes normal system operation (without testing, i.e., performing test cases) and system performance during test case execution.

Subsequent stages such as the preprocessing of these data and the training of the machine learning model are described in detail in Section 2.1 and Section 2.2. The model prepared in this way is next utilized during the execution of embedded system tests to identify potential anomalies present in the environment, leading to false-negative test results. After executing the test cases, the embedded test engineer is informed whether the obtained negative test result is reliable. They consider this information in the process of reporting test results and planning further steps (e.g., revision of the test environment).

2.1. Data Preprocessing

The data collected during the execution of the test case, in the form of a test execution report and a log file containing information about all values and events in the test system, must undergo a pre-processing process.

The product of the execution of the automated test cases is the test result, which can be positive or negative. The test execution process is documented in an HTML file containing all test steps preceded by timestamps. Each assertion in the test is also accompanied by a verdict indicating whether it is positive or negative. A test is considered negative if at least one of its assertions produces a negative result. There is the possibility of recording the behavior of the test environment in the form of a log file, collecting data about all variable quantities in the system. It contains information about the communication on all buses in the system, varying values of system variables used for simulation control, and dynamic changes in physical values throughout the system (both real and simulated), such as current, voltage, and temperature. Due to the implementation-oriented nature of the research, the preliminary processing activities are reduced to a minimum and consist primarily of selecting the relevant features for a given test case. The basis for feature extraction is derived from the information contained in the HTML report file, which contains information about the subsequent steps of the executed test. Using scripts written in Python 3.11, the data from the log file are split into groups (e.g., communication-related data, voltage measurements, current measurements) and loaded into TXT files. In the data cleaning procedure, irrelevant columns are removed and all data formatting issues are corrected. The log entries are then analyzed in the data parsing procedure and relevant information is distinguished based on the test execution report. The log messages are converted into structured fields based on predefined patterns (e.g., for communication signals, current, and voltage values). In the process of feature engineering, additional information is extracted, e.g., about the duration of specific system states. Categorical variables are transformed into numerical representations using the label coding technique. In the data transformation stage, numerical features are normalized. Information from the communication buses, where the presence of a message rather than its value is important, is aggregated to reduce the amount of data. The preprocessed data are split into learning (training and testing) and validation data. In machine learning modeling, particularly for classification problems, having balanced access to the dataset during the training phase is crucial for model effectiveness. Data balancing techniques like Synthetic Minority Oversampling Technique (SMOTE), introduced by Chawla [17], help improve classifier performance. If data balancing issues are identified, the technique is applied accordingly. The preprocessed data are then saved in a CSV format to train the model.

2.2. Applied AI Algorithms

To address the challenge of automating the analysis of test results for embedded systems, the Python language was selected due to its widespread availability, open-source nature, and rich collection of tools and libraries dedicated to artificial intelligence. This decision is primarily explained by the vast access to validated algorithms whose effectiveness has been confirmed in numerous scientific articles.

In the context of exploration and evaluation of algorithms dedicated to anomaly detection, it is decided to use the Python Outlier Detection library (PyOD) [18]. This library stands out for its extensive implementations of various anomaly detection algorithms. Its application is particularly justified in the context of the ongoing project, enabling efficient analysis and identification of anomalies in test data.

Based on benchmark results presented by Han et al. [19], it is decided to choose 7 unsupervised (ECOD, PCA, AvgKNN, IForest, INNE, AutoEncoder, ALAD) and 7 supervised (CatB+, LGB, MLP, SVM, XGB+, ResNet, FTTransformer) algorithms for the evaluation of the proposed AVESYS framework.

The ECOD algorithm proposed by Li [20] is based on estimating the underlying distribution of the input data in a non-parametric way, calculating the empirical cumulative distribution for each dimension of the data. It then uses these distributions to estimate the tail probabilities for each dimension for each data point. Finally, ECOD calculates an outlier for each data point by aggregating the estimated tail probabilities across dimensions.

The PCA algorithm is applied because of its efficacy in uncovering the underlying pattern and reduced computational complexity. The research uses the approach described by Shyu et al. [21].

The performance of the classic KNN (K-Nearest Neighbors) algorithm used for classification and regression tasks is investigated. The AvgKNN algorithm efficiently computes the weights using a Hilbert space-filling curve. It operates in two phases: the first phase provides an approximate solution with low time complexity, and the second phase refines the solution with exact results in a single scan [22]. Another algorithm designed for anomaly detection is Isolation Forest (IForest). It is developed based on the principle that anomalies are easier to isolate in a dataset. IForest builds a random forest of decision trees to effectively identify outliers. The algorithm works by randomly selecting features and creating paths in the trees to isolate individual data points. Anomalies that are rarer are expected to require fewer splits to isolate. The IForest algorithm assigns anomaly scores based on how quickly instances are isolated in the trees. Anomalies that are isolated faster receive lower scores, making it effective in identifying outliers in multidimensional datasets. This algorithm is particularly useful in scenarios where anomalies are rare and show different patterns to normal instances [23]. Bandaragoda et al. [24] presented the INNE algorithm using an alternative isolation mechanism based on Nearest Neighbor Ensemble. Although INNE is based on nearest neighbors, it performs much faster than similar existing methods, especially in datasets with thousands of dimensions or millions of instances. This is because the proposed method has linear time complexity and constant spatial complexity.

The autoencoder is chosen for this study because of its effective representation of complex data. The autoencoder consists of an encoder, which compresses the input data into a lower-dimensional representation, and a decoder, which is responsible for reconstructing the original input data. The aim of using an autoencoder is to minimize reconstruction error, encouraging the network to learn a meaningful representation of the data. The architecture of the autoencoder is symmetric, with the number of neurons in the encoder bottleneck layer typically representing the encoding of the input data in a lower dimension [25]. Zenatii et al. [26] proposed an approach to anomaly detection based on a cyclically consistent GAN architecture for sequential data and a new way to measure reconstruction error; this approach is hereafter referred to as the ALAD (Adversial Learning for Time Series) algorithm. In this study, we also consider the XGBoost (XGB+) algorithm widely used by researchers to handle sparse data. The most important distinguishing factor of this algorithm is its scalability in all scenarios due to several significant optimizations, including a novel tree learning algorithm for handling sparse data and a quantile sketch procedure to handle instance weights in approximate tree learning [27]. Prokhorenkova et al. [28] proposed an innovative approach to gradient boosting, CatB+. The toolkit they implemented combines an advanced ordered boosting, a permutation-driven alternative to the classical algorithm, and an innovative algorithm for processing categorical features. Both techniques are developed to combat the prediction shift caused by a special type of target leakage present in all currently existing implementations of gradient boosting algorithms.

Another innovative approach is the LightGBM (LGB) algorithm based on two new techniques, Gradient-based One-Side Sampling (GOSS) and Exclusive Feature Bundling (EFB). GOSS is based on excluding a significant proportion of instances with small gradients and using the rest to estimate the information gain. EFB combines mutually exclusive features to reduce the number of features [29].

Considering that Orru’ et al. [30] successfully validated the learning capabilities of two different algorithms, Support Vector Machine (SVM) and Multilayer Perceptron (MLP) for potential fault recognition in the oil and gas industry, we include these two algorithms in our analyses.

Two simple models for tabular data proposed by Gorishniy et al. are also investigated: the modified ResNet (Residual Network) architecture and the FTTransformer (Feature Tokenizer + Transformer). The main innovation in ResNet is the use of residual blocks. Traditional deep networks suffer from the problem of the vanishing gradient, making it difficult to train very deep networks. Residual blocks use skip connections or shortcuts to skip one or more layers, allowing for the gradient to flow more easily through the network. In the adaptation, the main building block is simplified compared to the original architecture and there is an almost clean path from input to output, which is considered beneficial for optimization. In a nutshell, the FTTransformer model transforms all features (categorical and numerical) to embeddings and applies a stack of Transformer layers to the embeddings. Thus, every Transformer layer operates on the feature level of one object [31].

3. Results and Discussion

3.1. Experimental Setup

In Figure 2, a schematic of the experimental setup is shown.

This setup is based on a Hardware in the Loop (HIL) simulation containing the embedded Device Under Test (DUT). In this case, it is a high-voltage battery controller including real and simulated sensors and actuators. The electrically simulated sensors and actuators in the HIL configuration emulate the behavior of the real counterparts, enabling supervised testing of the DUT. The physical sensors and actuators are part of the real configuration, providing real input and output signals to and from the DUT, enabling realistic test scenarios.

The HIL simulation is controlled from a PC, using dedicated software. For the simulation, a virtual representation of the vehicle is created, as a model based on physical phenomena, which simulates the behavior of the real vehicle (its dynamics, control systems and the vehicle’s reactions to various inputs). This makes it possible to assess the functionality and performance of the DUT in a controlled environment.

The setup is equipped with a software logger, which is a data acquisition system that records and stores information generated during the testing process in the test environment. It captures signals from sensors, actuators and DUTs to analyze system behavior and identify potential problems. During the execution of test cases, the test oracle assesses the correctness of the results or behavior of the DUT against the expected results. It serves as a benchmark for evaluating the success or failure of tests, helping to identify deviations from expected behavior. In the case of a negative test result, it is the task of the AVESYS framework to validate it using a previously trained neural network.

The object of the conducted experiment is to evaluate the performance and quality of the different artificial intelligence algorithms used in the proposed AVESYS framework.

3.2. Experiment Dataset

The organization of the data used in the experiment is shown in Figure 2. Learning data are understood to be the data used in the process of training artificial neural networks and in the evaluation of supervised and unsupervised algorithms. The data are divided into train data (90% of the dataset) and test data (10% of the dataset). In cross-validation, the entire dataset of learning data was used as train data, and new, unseen data called verification data were taken as test data.

Data from two sources were utilized during the research:

The ODDS dataset
This open-access dataset was chosen to validate the performance and assess the quality of the artificial intelligence algorithms available in the PyOD library as a first step of the research. The Outlier Detection Dataset ODDS) library [32] provides extensive outlier detection datasets collected from different domains. The selected dataset has a size, a number of features, and a percentage of anomalies similar to the data used in the current research.
The advantage was that the data were already labeled, so the performance of unsupervised and supervised algorithms could be benchmarked without additional effort.
Real embedded test dataset
This term refers to data collected by the logger from the real Embedded System Under Test during the execution of the specified test cases on the Device Under Test. The raw data from the log file has to undergo a preprocessing process before it is forwarded to the artificial intelligence algorithms.

3.3. Experimental Results

The first stage of the study focuses on the analysis of unsupervised algorithms. In the first step, to validate their performance and assess their quality, the algorithms are tested on the ODDS dataset.

Results of this stage are presented in Table 1 and Table 2.

In general, the evaluation of multiple anomaly detection algorithms on both training and test datasets reveals varying degrees of effectiveness across different metrics. The results highlight the diversity in performance, emphasizing the importance of selecting an algorithm based on specific priorities and the nature of the data. It is noteworthy that the Isolation Forest (IForest) algorithm stands out for its robust performance, consistently achieving high Roc_Auc_Score, precision and low mean square error (MSE) scores. Other algorithms, such as ECOD, PCA and INNE, also show commendable performance across multiple metrics.

The next stage of the experimentation is to implement the aforementioned algorithms using the best configured hyperparameter settings on a real embedded test dataset. Results are shown in Table 3 and Table 4.

The obtained results can disappointingly be described as unsatisfactory. Modifications to the hyperparameters did not result in an improved performance of anomaly detection in the logs from the real embedded systems tested.

It is decided to return to the first stage of the study to carry it out for supervised algorithms. Results of this stage are presented in Table 5 and Table 6.

The obtained excellent ROC AUC results (1.0) with zero mean squared error for the CatB+, LGB, XGB+, ResNet and FTTransformer algorithms demonstrate the high performance of anomaly detection in the tested ODDS dataset for supervised algorithms. An evaluation of these algorithms is also performed for test data obtained from real embedded systems. Results are presented in Table 7 and Table 8.

To summarize the obtained results, some algorithms perform exceptionally well on the training set, but there are some indications of potential over-fitting (LGB, CatB+) considering the performance of these algorithms on the test data. The decrease in performance observed for several models when moving from the learning set to the test set suggests that some models may have difficulties with generalizing to the new, unseen data. Mean squared error (MSE) values provide insight into the accuracy of regression predictions. In general, low MSE values indicate accurate predictions, and values are relatively consistent across algorithms. The SVM algorithm consistently achieves excellent performance in both Roc_Auc_Score and precision on both the training and test datasets, implying robust generalization capabilities. MSE values are consistently low, indicating accurate predictions. Similar results are reported for the MLP algorithm, but its lower performance on the training set suggests potential over-fitting or high sensitivity to the training data.

In the next step of this study, the K-fold cross-validation technique is employed, entailing the partitioning of the learning dataset into K equally sized segments. K-1 subsets of data are utilized for model training (train data), while the remaining subset serves for validation (test data). This technique involves K iterations, each time reserving a specified subset for validation. Exclusion of training samples from those used to evaluate candidate parameter values reduces the likelihood of overfitting, thereby enhancing the generalization of the classifier [33].

The real embedded test dataset is divided into 10 subsets, and the model training and testing processes are repeated 10 times. Subsequently, the average values of the parameters determining the model’s quality are computed. The dataset contains 231,880 samples.

For the examined algorithms, confusion matrices are determined for both training and testing data, calculated as the sum of values

T_{n}, F_{p}, F_{n}, T_{p}

across all trials. Results are shown in Table 9 and Table 10.

The SVM algorithm and XGB+ achieved a high count of true negatives (

T_{n}

) and true positives (

T_{p}

) and a relatively low count of false positives (

F_{p}

) and false negatives (

F_{n}

) on both training and test data, indicating effective classification of negative instances and the ability of the model to effectively classify negative instances in unseen data. In summary, the algorithms generally show high performance in classifying both negative and positive instances, with some differences in the balance between true and false positives.

The cross-validation results, after computing the average values of the quality indicators, are presented in Table 11 and Table 12. The research results demonstrate the performance of five different algorithms: MLP, SVM, XGB+, LGB, and CatB+ across various evaluation metrics, including Roc_Auc_Score, precision, R2 Score, Matthews Correlation Coefficient (MCC), and Balanced Accuracy (BA).

All algorithms demonstrate excellent discriminatory power, with Roc_Auc_Score close to or equal to 1.0000 for both training and test data. The SVM, XGB+, and CatB+ algorithms show perfect Roc_Auc_Score, indicating optimal performance in distinguishing between classes. For all algorithms, the precision values are consistently high for both training and test data, reflecting the ability of the models to correctly identify positive instances. SVM, XGB+, and CatB+ algorithms stand out with precision values approaching or equal to 1.0000 on the test data. R2 scores, indicating the percentage of variance explained by the models, are generally high for all algorithms on the test data. XGB+ demonstrates perfect R2 scores, suggesting an excellent fit to the data. The MCC values, which measure the quality of binary classifications, are consistently high across algorithms and datasets. The XGB+ and SVM algorithms show particularly high MCC scores, indicating reliable classification performance. Balanced accuracy scores are consistently high, indicating a well-balanced performance between sensitivity and specificity for all algorithms. The XGB+ and SVM algorithms show the highest balanced accuracy scores on the test data.

The performance of various algorithms is assessed through both Mean Squared Error (MSE) and Mean Squared Logarithmic Error (MSLE), which is presented in Table 13 and Table 14. Each algorithm’s ability to minimize these metrics provides insights into their effectiveness in capturing the underlying patterns within the dataset.

Multilayer Perceptron (MLP) demonstrates competitive performance, with an MSE of 0.0082 and an MSLE of 0.0035. These values indicate reasonable precision of the model, although further optimization may be explored to potentially enhance its performance. Support Vector Machine (SVM) outperforms other models in terms of precision, exhibiting the lowest MSE (0.0008) and MSLE (0.0004). This suggests that SVM is highly effective in minimizing errors and capturing variability within the data. Ensemble models, represented by XGB+, LGB, and CatB+, consistently outperform individual models, presenting MSE and MSLE values that are significantly lower than the MLP. This highlights the effectiveness of ensemble methods in improving predictive accuracy. The lower MSLE values across all algorithms compared to MSE values suggest that the models perform well across different magnitudes of predictions. This is particularly important in scenarios where predictions cover a wide range of values. The varying performance of different algorithms indicates sensitivity to algorithmic choices. Understanding the strengths and weaknesses of each algorithm is crucial for selecting the most suitable model based on the specific requirements of the task.

In summary, the evaluated algorithms consistently perform well across a range of metrics, showcasing their effectiveness in handling the classification task. The choice of the most suitable algorithm may depend on specific priorities, such as precision, interpretability, or computational efficiency, as indicated by the different strengths observed in the metrics.

3.4. Cross-Validation

After fitting the model to the training data, it is imperative to verify whether the trained model also performs adequately when exposed to real-world data. It is essential to confirm that the model is well acquainted with patterns within the data and does not exhibit excessive noise. The performance of the machine learning model was assessed through cross-validation. This involved training the model on a subset of input data and testing it on an unseen dataset [34].

A cross-validation of the investigated algorithms was performed using the entire dataset used in previous studies: a real embedded test dataset as the train data, and the Test Data representing new, unseen data collected in the form of logs from the real test system while performing a distinct test case from the preceding cases. In this case, the training dataset contains 231,880 samples (including 24,081 outliers, which is 10%), while the test dataset contains 146,178 samples (including 12,040 outliers, which is 10% as well). The results of the algorithm evaluation are presented in the form of quality indicators in Table 15 and Table 16, MSE and MSLE in Table 17 and Table 18, and the confusion matrices in Table 19 and Table 20.

MLP, SVM, XGB+, LGB and CatB+ algorithms sustained high performance on the test data, achieving high scores in ROC AUC, precision, R2 score, MCC, and balanced accuracy (BA). Although there were slight performance decreases compared to the training set. ResNet performance remained non-optimal on the test set, highlighting the challenges in generalization and capturing the data nuances. FTTransformer showed contrasting mixed results, with improvements in some metrics, but still lagged behind other algorithms.

MLP, SVM, XGB+, LGB, and CatB+ consistently performed well in terms of regression metrics on both training and test datasets, suggesting their reliability for regression tasks. The choice of algorithm may depend on the specific requirements of the application, considering the trade-off between computational efficiency and the magnitude of regression errors.

MLP, SVM, XGB, LGB, and CatB+ algorithms demonstrated strong performance on the train data, with high true positive (Tp) counts and relatively low false positive (Fp) and false negative (Fn) counts. SVM and XGB stood out with perfect true negative (Tn) counts, indicating precise classification of negative instances. ResNet and FTTransformer showed distinctive characteristics. ResNet had a perfect true positive count but misclassified all negative instances, resulting in high false negative and low true negative counts. FTTransformer exhibited a higher false positive count, indicating challenges in distinguishing negative instances. MLP, SVM, XGB, LGB, and CatB+ algorithms continued to demonstrate good performance on the test data, with high Tp counts and relatively low Fp and Fn counts. ResNet performed poorly on the test set, misclassifying all positive instances, leading to high false negative counts. FTTransformer also faced challenges, particularly with a high false positive count, suggesting difficulties in distinguishing negative instances. The choice of the most suitable algorithm should align with the specific needs and constraints of the classification task.

In summary, while certain algorithms demonstrated strong performance across various metrics, the ability to generalize to new data varied. The SVM and MLP algorithms were outstanding as highly reliable models on both training and test datasets, making them strong candidates for practical applications.

4. Conclusions

The aim of this research was to design and test a method to integrate artificial intelligence algorithms into the embedded software testing process at the phase of verification and validation of the results. This phase represents a critical constraint, requiring the involvement of specialized experts. The authors are convinced that by applying artificial intelligence algorithms, it is feasible to optimize this process and significantly shorten it. In the perspective of real industrial projects, which are very often developed using agile methodologies, thus ensuring minimum time intervals for the various stages, this is of great importance.

The authors in their study examined the behavior of both unsupervised and supervised algorithms. From an industrial point of view, it appeared that unsupervised algorithms would be the preferred solution. However, the research carried out on data collected from a real test system showed that adjusting the parameters of the algorithms to ensure that the results obtained from them are sufficient requires time and the involvement of an artificial intelligence expert for each embedded system under test. From an implementation perspective, this limitation makes it a solution that does not provide the expected optimization of costs and human resources.

In this contribution, we also presented the experimental results of several selected supervised algorithms on the same data from a real test system.

The best results were obtained for MLP and SVM algorithms. A low MSE error value was observed for them (MLP—0.4% and SVM—0.2%), demonstrating the high accuracy of the regression predictions. Lower performance of the MLP algorithm on the training set suggests potential over-fitting or high sensitivity to training data. The SVM algorithm performed better in this case, which is characterized by being resistant to overfitting by minimizing the length of the weight vector. Analyzing the confusion matrix, a conclusion emerged that the SVM algorithm is the one that demonstrates the most efficient classification of negative instances and the ability of the model to successfully classify negative instances in unseen data.

The application of supervised algorithms requires more involvement in the processing of the training data, which has to be labeled. However, this can be performed by the same engineer who develops the test cases. Knowledge or experience of artificial intelligence is not required, only familiarity with the functioning of the test system itself and the test environment. This is a necessary stage that can easily be automated by creating an expertise database of rules for specific groups of projects.

The benefits of automating the analysis of negative test verdicts are various. First, it accelerates the testing process, enabling earlier feedback and faster debugging. This, in turn, boosts product development cycles and release time. Second, it mitigates the risk of human error, increasing the reliability and consistency of test results. Third, it optimizes resource allocation, allowing for testing teams to focus their expertise on more complex and strategic tasks. Finally, it supports a data-driven approach to testing, utilizing the power of big data analytics to uncover insights that might otherwise remain hidden.

Author Contributions

Conceptualization, A.G.-G. and P.P.; methodology, A.G.-G.; software, A.G.-G.; validation, A.G.-G. and P.P.; formal analysis, A.G.-G.; investigation, A.G.-G.; resources, A.G.-G.; data curation, A.G.-G.; writing—original draft preparation, A.G.-G.; writing—review and editing, P.P.; visualization, A.G.-G.; supervision, P.P.; project administration, A.G.-G.; funding acquisition, P.P. All authors have read and agreed to the published version of the manuscript.

Funding

Research supported by the Polish Ministry of Education and Science Grant No DWD/4/55/2020. This publication is supported by the statutory funds of the Department of Fundamentals of Machinery Design.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Restrictions apply to the availability of these data. Data were obtained from DIP Draexlmaier Engineering Poland Sp. z o.o (Gliwice, Poland) and are available from the corresponding author with the permission of Dreaxlmaier.

Conflicts of Interest

Author A.G.-G. was employed by the company DIP Draexlmaier Engineering Poland Sp. z o.o. The remaining author declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AE	Autoencoder
AI	Artificial Intelligence
ALAD	Adversial Learning for Time Series
ASPICE	Automotive Software Process Improvement and Capability Determination
AVESYS	Automated Verification System
BA	Balanced Accuracy
CatB+	Unbiased Categorical Boosting with categorical features
DTC	Decision tree classifier
DUT	Device Under Test
ECOD	Empirical cumulative distribution functions for outlier detection
EFB	Exclusive Feature Bundling
Fn	False Negative
Fp	False Positive
FTTransformer	Feature Tokenizer + Transformer
GAN	Generative Adversarial Networks
GOSS	Gradient-based One-Side Sampling
GRU	Gated Recurrent Unit
HIL	Hardware in the Loop
HTML	HyperText Markup Language
INNE	Nearest Neighbor Ensemble
IForest	Isolation Forest
IoT	Internet of Things
KNN	K-Nearest Neighbors
LGB	LightGBM Light Gradient Boosting
MCC	Matthews Correlation Coefficient
MIDS	Measurement Intrusion Detection System
ML	Machine Learning
MLP	Multilayer Perceptron
MSE	Mean Squared Error
MSLE	Mean Squared Logarithmic Error
NLP	Natural Language Processing
ODDS	Outlier Detection Datasets
PCA	Principal Component Analysis
PyOD	Python Outlier Detection
ResNet	Residual Network
R2 score	Coefficient of determination regression score function
ROC AUC	Area Under the Curve of the Receiver Operating Characteristic
SMOTE	Synthetic Minority Oversampling Technique
SVM	Support Vector Machine
Tp	True Positive
Tn	True Negative
UML	Unified Modeling Language
XGB+	Scalable end-to-end tree boosting system

References

Kum, D.; Son, J.; Lee, S.; Wilson, I. Automated Testing for Automotive Embedded Systems. In Proceedings of the 2006 SICE-ICASE International Joint Conference, Busan, Republic of Korea, 18–21 October 2006; pp. 4414–4418. [Google Scholar]
Placho, T.; Schmittner, C.; Bonitz, A.; Wana, O. Management of automotive software updates. Microprocess. Microsyst. 2020, 78, 103257. [Google Scholar] [CrossRef]
Axelrod, A. Complete Guide to Test Automation; Apress: New York, NY, USA, 2018. [Google Scholar]
Khaliq, Z.; Farooq, S.U.; Khan, D.A. Artificial Intelligence in Software Testing: Impact, Problems, Challenges and Prospect. arXiv 2022, arXiv:2201.05371. [Google Scholar]
Hourani, H.; Hammad, A.; Lafi, M. The Impact of Artificial Intelligence on Software Testing. In Proceedings of the 2019 IEEE Jordan International Joint Conference on Electrical Engineering and Information Technology (JEEIT), Amman, Jordan, 9–11 April 2019; pp. 565–570. [Google Scholar] [CrossRef]
Verma, R.P.; Beg, M.R. Generation of Test Cases from Software Requirements Using Natural Language Processing. In Proceedings of the 2013 6th International Conference on Emerging Trends in Engineering and Technology, Nagpur, India, 16–18 December 2013. [Google Scholar]
Ansari, A.; Shagufta, M.; Fatima, A.; Tehreem, S. Constructing Test cases using Natural Language Processing. In Proceedings of the 2017 Third International Conference on Advances in Electrical, Electronics, Information, Communication and Bio-Informatics (AEEICB), Chennai, India, 27–28 February 2017. [Google Scholar]
Helali Moghadam, M. Machine Learning-Assisted Performance Testing. In Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, Tallinn, Estonia, 26–30 August 2019. [Google Scholar] [CrossRef]
Kikuma, K.; Yamada, T.; Sato, K.; Ueda, K. Preparation Method in Automated Test Case Generation using Machine Learning. In Proceedings of the 10th International Symposium on Information and Communication Technology, Halong Bay, Vietnam, 4–6 December 2019; pp. 393–398. [Google Scholar] [CrossRef]
Raikwar, S.; Jijyabhau Wani, L.; Arun Kumar, S.; Sreenivasulu Rao, M. Hardware-in-the-Loop test automation of embedded systems for agricultural tractors. Measurement 2019, 133, 271–280. [Google Scholar] [CrossRef]
Battina, D.S. Artificial Intelligence in Software Test Automation: A Systematic Literature Review. Int. J. Emerg. Technol. Innov. Res. 2019, 6, 1329–1332. Available online: https://www.jetir.org/papers/JETIR1912176.pdf (accessed on 5 March 2024).
Bajer, M.; Szlagor, M.; Wrzesniak, M. Embedded software testing in research environment. A practical guide for non-experts. In Proceedings of the 2015 4th Mediterranean Conference on Embedded Computing (MECO), Budva, Montenegro, 14–18 June 2015; pp. 100–105. [Google Scholar] [CrossRef]
Cordeiro, L.C. Automated Verification and Synthesis of Embedded Systems using Machine Learning. arXiv 2017, arXiv:1702.07847. [Google Scholar]
Bielefeldt, J.; Kai-Uwe, B.; Reza Khan, S.; Massah, M.; Hans-Werner, W.; Scharoba, S.; Hübner, M. DeepTest: How Machine Learning Can Improve the Test of Embedded Systems. In Proceedings of the 2021 10th Mediterranean Conference on Embedded Computing (MECO), Budva, Montenegro, 7–10 June 2021; pp. 1–6. [Google Scholar] [CrossRef]
Mokhtari, S.; Abbaspour, A.; Yen, K.K.; Sargolzaei, A. A Machine Learning Approach for Anomaly Detection in Industrial Control Systems Based on Measurement Data. Electronics 2021, 10, 407. [Google Scholar] [CrossRef]
Lu, Y.W.; Hsu, C.Y.; Huang, K.C. An Autoencoder Gated Recurrent Unit for Remaining Useful Life Prediction. Processes 2020, 8, 1155. [Google Scholar] [CrossRef]
Chawla, N.; Bowyer, K.; Hall, L.; Kegelmeyer, W. SMOTE: Synthetic Minority Over-sampling Technique. J. Artif. Intell. Res. (JAIR) 2002, 16, 321–357. [Google Scholar] [CrossRef]
Zhao, Y.; Nasrullah, Z.; Li, Z. PyOD: A Python Toolbox for Scalable Outlier Detection. J. Mach. Learn. Res. 2019, 20, 1–7. [Google Scholar]
Han, S.; Hu, X.; Huang, H.; Jiang, M.; Zhao, Y. Adbench: Anomaly detection benchmark. Adv. Neural Inf. Process. Syst. 2022, 35, 32142–32159. [Google Scholar] [CrossRef]
Li, Z.; Zhao, Y.; Hu, X.; Botta, N.; Ionescu, C.; Chen, G.H. ECOD: Unsupervised Outlier Detection Using Empirical Cumulative Distribution Functions. IEEE Trans. Knowl. Data Eng. 2023, 35, 12181–12193. [Google Scholar] [CrossRef]
Shyu, M.L.; Chen, S.C.; Sarinnapakorn, K.; Chang, L. A Novel Anomaly Detection Scheme Based on Principal Component Classifier. In Proceedings of the International Conference on Data Mining, San Francisco, CA, USA, 1–3 May 2003. [Google Scholar]
Angiulli, F.; Pizzuti, C. Fast Outlier Detection in High Dimensional Spaces. In Proceedings of the European Conference on Principles of Data Mining and Knowledge Discovery, Helsinki, Finland, 19–23 August 2002. [Google Scholar]
Liu, F.T.; Ting, K.M.; Zhou, Z.H. Isolation Forest. In Proceedings of the 2008 Eighth IEEE International Conference on Data Mining, Pisa, Italy, 15–19 December 2008; pp. 413–422. [Google Scholar] [CrossRef]
Bandaragoda, T.R.; Ting, K.M.; Albrecht, D.; Liu, F.T.; Zhu, Y.; Wells, J.R. Isolation-based anomaly detection using nearest-neighbor ensembles. Comput. Intell. 2018, 34, 968–998. [Google Scholar] [CrossRef]
Aggarwal, C.C. Outlier Analysis; Springer: Berlin/Heidelberg, Germany, 2015. [Google Scholar]
Zenati, H.; Romain, M.; Foo, C.S.; Lecouat, B.; Chandrasekhar, V.R. Adversarially Learned Anomaly Detection. arXiv 2018, arXiv:1812.02288. [Google Scholar]
Chen, T.; Guestrin, C. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD’16, San Francisco, CA, USA, 13–17 August 2016; ACM: New York, NY, USA, 2016. [Google Scholar] [CrossRef]
Prokhorenkova, L.; Gusev, G.; Vorobev, A.; Dorogush, A.V.; Gulin, A. CatBoost: Unbiased boosting with categorical features. arXiv 2019, arXiv:1706.09516. [Google Scholar]
Ke, G.; Meng, Q.; Finley, T.; Wang, T.; Chen, W.; Ma, W.; Ye, Q.; Liu, T.Y. LightGBM: A Highly Efficient Gradient Boosting Decision Tree. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, Red Hook, NY, USA, 4–9 December 2017; pp. 3149–3157. [Google Scholar]
Orrù, P.F.; Zoccheddu, A.; Sassu, L.; Mattia, C.; Cozza, R.; Arena, S. Machine Learning Approach Using MLP and SVM Algorithms for the Fault Prediction of a Centrifugal Pump in the Oil and Gas Industry. Sustainability 2020, 12, 4776. [Google Scholar] [CrossRef]
Gorishniy, Y.; Rubachev, I.; Khrulkov, V.; Babenko, A. Revisiting Deep Learning Models for Tabular Data. arXiv 2023, arXiv:2106.11959. [Google Scholar]
Outlier Detection DataSets. Available online: https://odds.cs.stonybrook.edu/ (accessed on 21 January 2024).
Ramezan, C.A.; Warner, T.A.; Maxwell, A.E. Evaluation of Sampling and Cross-Validation Tuning Strategies for Regional-Scale Machine Learning Classification. Remote Sens. 2019, 11, 185. [Google Scholar] [CrossRef]
Berrar, D. Cross-Validation. Life Sci. 2019, 1, 542–545. [Google Scholar] [CrossRef]

Figure 1. Overview of the implementation of the AVESYS structure.

Figure 2. AVESYS experimental setup.

Table 1. Results of PyOD unsupervised algorithms on train data (ODDS dataset).

Algorithm	Roc_Auc_Score	Precision	MSE
ECOD	0.993	0.8657	0.031
PCA	0.9903	0.9508	0.034
AvgkNN	0.7029	0.2186	0.13
IForest	0.9967	0.9352	0.03
INNE	0.9895	0.832	0.03
AutoEncoder	0.9903	0.9508	0.03
ALAD	0.9542	0.5579	0.06

Table 2. Results of PyOD unsupervised algorithms on test data (ODDS dataset).

Algorithm	Roc_Auc_Score	Precision	MSE
ECOD	0.9926	0.875	0.03
PCA	0.9879	0.9506	0.035
AvgkNN	0.7066	0.2166	0.13
IForest	0.9952	0.9477	0.03
INNE	0.9904	0.8399	0.03
AutoEncoder	0.9879	0.9506	0.03
ALAD	0.9542	0.5727	0.06

Table 3. Results of PyOD unsupervised algorithms on train data (real embedded test dataset).

Algorithm	Roc_Auc_Score	Precision	MSE
ECOD	0.3177	0.0093	0.20
PCA	0.3105	0.0003	0.20
AvgkNN	0.4666	0.0526	0.19
IForest	0.2725	0.0282	0.19
INNE	0.3122	0.0	0.2
AutoEncoder	0.3111	0.0057	0.19
ALAD	0.3126	0.0154	0.17

Table 4. Results of PyOD unsupervised algorithms on test data (real embedded test dataset).

Algorithm	Roc_Auc_Score	Precision	MSE
ECOD	0.3805	0.0141	0.17
PCA	0.3104	0.0003	0.17
AvgkNN	0.3464	0.0	0.91
IForest	0.2751	0.0224	0.19
INNE	0.6345	0.0	0.12
AutoEncoder	0.3068	0.0074	0.18
ALAD	0.2765	0.0149	0.14

Table 5. Results of PyOD supervised algorithms on train data (ODDS dataset).

Algorithm	Roc_Auc_Score	Precision	MSE
CatB+	1.0	1.0	0
LGB	1.0	1.0	0
MLP	0.9955	0.9915	0.001
SVM	0.9959	0.988	0.002
XGB+	1.0	1.0	0
ResNet	0.9953	0.9883	0.001
FTTransformer	1.0	0.9982	0

Table 6. Results of PyOD supervised algorithms on test data (ODDS dataset).

Algorithm	Roc_Auc_Score	Precision	MSE
CatB+	1.0	0.9985	0
LGB	1.0	0.9984	0
MLP	0.9955	0.9855	0.002
SVM	0.9915	0.984	0.002
XGB+	1.0	1.0	0
ResNet	0.9959	0.9855	0.001
FTTransformer	1.0	0.9971	0

Table 7. Results of PyOD supervised algorithms on train data (real embedded test dataset).

Algorithm	Roc_Auc_Score	Precision	MSE
CatB+	0.9998	0.9777	0.003
LGB	0.9997	0.9709	0.004
MLP	0.9983	0.9256	0.04
SVM	1.0	0.9944	0
XGB+	1.0	1.0	0
ResNet	0.5357	0.1551	0.11
FTTransformer	0.8935	0.3405	0.11

Table 8. Results of PyOD supervised algorithms on test data (real embedded test dataset).

Algorithm	Roc_Auc_Score	Precision	MSE
CatB+	0.7796	0.1314	0.06
LGB	0.7907	0.1203	0.06
MLP	1.0	0.9917	0.02
SVM	1.0	1.0	0.01
XGB+	0.7602	0	0.06
ResNet	1.0	0.9977	0.04
FTTransformer	0.9957	0.985	0.06

Table 9. Results of K-fold cross-validation: Confusion matrices for train data.

Algorithm	$T_{n}$	$F_{p}$	$F_{n}$	$T_{p}$
MLP	1,845,825	24,366	7021	209,708
SVM	1,869,017	1174	1222	215,507
XGB	1,870,191	0	0	216,729
LGB	1,866,442	3749	13,464	203,265
CatB+	1,859,396	10,795	3043	213,686

Table 10. Results of K-fold cross-validation: Confusion matrices for test data.

Algorithm	$T_{n}$	$F_{p}$	$F_{n}$	$T_{p}$
MLP	205,152	2647	758	23,323
SVM	207,668	131	138	23,943
XGB	207,791	7	6	24,075
LGB	207,635	207,635	207,635	207,635
CatB+	206,366	1433	594	23,487

Table 11. Results of K-fold cross-validation: algorithm’s performance average metrics on train data.

Algorithm	Roc_Auc_Score	Precision	R2_Score	MCC	BA
MLP	0.9994	0.9610	0.8745	0.9271	0.9773
SVM	1.0000	0.9945	0.9906	0.9938	0.9969
XGB+	1.0000	1.0000	1.0000	1.0000	1.0000
LGB	0.9995	0.9593	0.9409	0.9551	0.9679
CatB+	0.9996	0.9668	0.9518	0.9651	0.9901

Table 12. Results of K-fold cross-validation: algorithm’s performance average metrics on test data.

Algorithm	Roc_Auc_Score	Precision	R2_Score	MCC	BA
MLP	0.9994	0.9599	0.8769	0.9287	0.9779
SVM	1.0000	0.9945	0.9905	0.9938	0.9968
XGB+	1.0000	0.9997	0.9993	0.9996	0.9998
LGB	0.9993	0.9511	0.9353	0.9499	0.9651
CatB+	0.9994	0.9548	0.9392	0.9539	0.9842

Table 13. Results of K-fold cross-validation: regression performance average metrics on train data.

Algorithm	MSE	MSLE
MLP	0.0117	0.0060
SVM	0.0009	0.0004
XGB+	0.0000	0.0000
LGB	0.0055	0.0027
CatB+	0.0045	0.0022

Table 14. Results of K-fold cross-validation: regression performance average metrics on test data.

Algorithm	MSE	MSLE
MLP	0.0114	0.0059
SVM	0.0009	0.0004
XGB+	0.0001	0.0000
LGB	0.0060	0.0030
CatB+	0.0057	0.0024

Table 15. Results of cross-validation: algorithm’s performance metrics on train data.

Algorithm	Roc_Auc_Score	Precision	R2_Score	MCC	BA
MLP	0.9998	0.9787	0.958	0.981	0.984
SVM	1.0000	0.9949	0.991	0.994	0.997
XGB+	1.0000	1.0000	0.999	1.000	1.000
LGB	0.9996	0.9603	0.946	0.958	0.964
CatB+	0.9997	0.9734	0.958	0.969	0.993
ResNEt	0.474	0.0	−0.12	−0.07	0.499
FTTransformer	0.6688	0.25	−0.408	0.098	0.537

Table 16. Results of cross-validation: algorithm’s performance metrics on test data.

Algorithm	Roc_Auc_Score	Precision	R2_Score	MCC	BA
MLP	0.9997	0.9661	0.942	0.970	0.974
SVM	0.9999	0.9949	0.991	0.980	0.982
XGB+	0.7693	0.2514	−0.820	0.140	0.577
LGB	0.7921	0.3668	−0.329	0.203	0.589
CatB+	0.7300	0.2846	0.180	0.490	0.629
ResNet	0.5000	0.0	−0.09	0.0	0.5
FTTransformer	0.5445	0.0	−1.244	0.059	0.541

Table 17. Results of cross-validation: regression performance metrics on train data.

Algorithm	MSE	MSLE
MLP	0.004	0.002
SVM	0.0	0.0
XGB+	0.0	0.0
LGB	0.005	0.002
CatB+	0.004	0.002
ResNet	0.10	0.05
FTTransformer	0.13	0.068

Table 18. Results of cross-validation: regression performance metrics on test data.

Algorithm	MSE	MSLE
MLP	0.004	0.002
SVM	0.002	0.001
XGB+	0.138	0.067
LGB	0.10	0.048
CatB+	0.06	0.03
ResNet	0.08	0.039
FTTransformer	0.17	0.092

Table 19. Confusion matrices for train data.

Algorithm	$T_{n}$	$F_{p}$	$F_{n}$	$T_{p}$
MLP	207,750	49	754	23,327
SVM	207,677	122	125	23,956
XGB	207,799	0	0	24,081
LGB	207,736	63	1747	22,334
CatB+	206,760	1039	149	23,932
ResNet	207,693	106	24,081	0
FTTransformer	197,431	10,368	21,070	3011

Table 20. Confusion matrices for test data.

Algorithm	$T_{n}$	$F_{p}$	$F_{n}$	$T_{p}$
MLP	134,114	24	635	11,405
SVM	134,121	17	428	11,612
XGB	122,044	12,094	9083	2957
LGB	127,877	6261	9342	2698
CatB+	134,138	0	8933	3107
ResNet	134,138	0	12,040	0
FTTransformer	111,623	22,515	9030	3010

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Gnacy-Gajdzik, A.; Przystałka, P. Automating the Analysis of Negative Test Verdicts: A Future-Forward Approach Supported by Augmented Intelligence Algorithms. Appl. Sci. 2024, 14, 2304. https://doi.org/10.3390/app14062304

AMA Style

Gnacy-Gajdzik A, Przystałka P. Automating the Analysis of Negative Test Verdicts: A Future-Forward Approach Supported by Augmented Intelligence Algorithms. Applied Sciences. 2024; 14(6):2304. https://doi.org/10.3390/app14062304

Chicago/Turabian Style

Gnacy-Gajdzik, Anna, and Piotr Przystałka. 2024. "Automating the Analysis of Negative Test Verdicts: A Future-Forward Approach Supported by Augmented Intelligence Algorithms" Applied Sciences 14, no. 6: 2304. https://doi.org/10.3390/app14062304

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Automating the Analysis of Negative Test Verdicts: A Future-Forward Approach Supported by Augmented Intelligence Algorithms

Abstract

1. Introduction

2. Materials and Methods

2.1. Data Preprocessing

2.2. Applied AI Algorithms

3. Results and Discussion

3.1. Experimental Setup

3.2. Experiment Dataset

3.3. Experimental Results

3.4. Cross-Validation

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI