Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Application of Symbolic Classifiers and Multi-Ensemble Threshold Techniques for Android Malware Detection

Big Data Cogn. Comput. 2025, 9(2), 27; https://doi.org/10.3390/bdcc9020027

by Nikola Anđelić^1,*,†

, Sandi Baressi Šegota^1,†

and Vedran Mrzljak²

Reviewer 1: Anonymous

Reviewer 2: Anonymous

Big Data Cogn. Comput. 2025, 9(2), 27; https://doi.org/10.3390/bdcc9020027

Submission received: 23 December 2024 / Revised: 17 January 2025 / Accepted: 24 January 2025 / Published: 29 January 2025

(This article belongs to the Special Issue Big Data Analytics with Machine Learning for Cyber Security)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

The paper introduces a novel methodology leveraging Genetic Programming Symbolic Classifiers (GPSCs) and Multi-Ensemble Threshold-Based Voting Ensembles (TBVE) for robust malware detection.

The key contributions of this paper includes the derivation of interpretable symbolic expressions for classification, optimization via RHVS, and the development of multi-TBVE strategies.

The results demonstrate superior performance, achieving till 98% accurate, with an analysis of feature importance and dimensionality reduction.

I find that the methodology's rigorous statistical and experimental validation underscores its potential for deployment in real-world scenarios, although its computational demands for parameter tuning can be a limitation.

To me, this work offers a substantial contribution to the domain of AI-driven cybersecurity, i.e. Android ecosystems.

I would suggest authors to look into more asdpects:

1. Adding insights or experiments addressing computational overhead and scalability into the existing methodology to larger datasets would enhance the practicality of the approach.

2. More detailed comparative analysis of GPSC with other interpretable machine learning models involving decision trees and/or rule-based systems could betetr the paper.

3. I suggest to include more examples or practical applications where this interpretability offers a tangible advantage in decision-making.

4. Matthews Correlation Coefficient on real-world malware samples could provide better metric for assessing the work.

5. Only a few articles are cited. More articles from 2024 should be incorporated.

Author Response

Reviewer 1

The authors of this manuscript want to thank reviewer 1 for his time and effort to read the manuscript and write his comments and suggestions which greatly improved the manuscript's quality. The authors do hope that the manuscript in this form will be accepted for publication.

The reviewer 1 comments are bold and after each comment the answer by the authors is provided.

The paper introduces a novel methodology leveraging Genetic Programming Symbolic Classifiers (GPSCs) and Multi-Ensemble Threshold-Based Voting Ensembles (TBVE) for robust malware detection.

The key contributions of this paper includes the derivation of interpretable symbolic expressions for classification, optimization via RHVS, and the development of multi-TBVE strategies.

The results demonstrate superior performance, achieving till 98% accurate, with an analysis of feature importance and dimensionality reduction.

To me, this work offers a substantial contribution to the domain of AI-driven cybersecurity, i.e. Android ecosystems.

Answer: Thank you for your thoughtful and encouraging feedback. We greatly appreciate your recognition of the key contributions of our methodology, including the derivation of interpretable symbolic expressions, optimization via RHVS, and the development of multi-TBVE strategies. We are also grateful for your acknowledgment of the rigorous statistical and experimental validation that underscores the real-world potential of our approach. Your observation regarding the computational demands of parameter tuning is well-noted, and we will consider strategies to address this limitation in future work. Your comments are highly motivating and reaffirm the relevance of our research in advancing AI-driven cybersecurity, particularly in the Android ecosystem. Thank you once again for your valuable insights!

I would suggest authors to look into more asdpects:

Adding insights or experiments addressing computational overhead and scalability into the existing methodology to larger datasets would enhance the practicality of the approach.

Answer: In the revised version of the manuscript we have added how much time it took to train GPSC on all balanced dataset variation. In the revised version of the manuscript we have added the Results section subsection entitled “Computational Resources and time”

Citing from revised version of the manuscript (The last subsection “Computational Resources and time” created at the end of the Results section) : “The entire investigation presented in this paper was conducted on a laptop equipped with an AMD Ryzen 5 Mobile 5500U CPU (6 cores, 12 threads) and 16 GB of DDR4-3200 RAM. The training of the GPSC was performed using Python, leveraging key libraries such as scikit-learn, gplearn, matplotlib, and seaborn.

Training the GPSC on a single split of 5-fold cross-validation (5FCV) took approximately 10 minutes. Consequently, completing the entire 5FCV process required 50 minutes. Since the study involved training the GPSC on 105 balanced dataset variations, the total theoretical time for the initial training was approximately 5250 minutes (87.5 hours).

However, the optimal SEs were not identified during the initial training because the RHVS method was used to fine-tune the GPSC hyperparameters. This iterative process aimed to find the best combination of hyperparameter values to achieve high classification performance. On average, the GPSC was trained three times for each balanced dataset variation, resulting in a total training time of approximately 15,750 minutes (262.5 hours or 10.93 days). This training process was the most time-intensive computation in the investigation.

Once the best SEs were obtained and the TBVEs were developed, their evaluation on the initial imbalanced datasets was performed. Evaluating a single TBVE on the 4461 samples took approximately 5 minutes. Since three TBVEs were created, the total evaluation time for all TBVEs was 15 minutes. This step was necessary to compute the evaluation metrics for the entire imbalanced dataset and to determine the threshold value.”

More detailed comparative analysis of GPSC with other interpretable machine learning models involving decision trees and/or rule-based systems could betetr the paper.

Answer: The idea of this paper was to investigate whether GPSC could be used to generate symbolic expressions (SEs) with high classification performance. Unlike decision trees or rule-based systems, GPSC produces compact, human-readable equations that are directly optimized for both interpretability and classification performance. Decision trees and rule-based systems, while interpretable, are often prone to overfitting (especially decision trees and Random Forest) and can result in lengthy, complex rules that hinder practical interpretability. Furthermore, the scalability of such models can become a challenge in high-dimensional datasets (this dataset in its original form contains 241 input variables which is a high-dimensional dataset). Therefore, we deliberately excluded these models to maintain the focus of this study on the unique strengths of GPSC. However, we acknowledge that future research could explore comparative analyses to provide further context for the advantages of GPSC.

I suggest to include more examples or practical applications where this interpretability offers a tangible advantage in decision-making.

Answer: Thank you for your valuable suggestion. In the manuscript, we have already included examples of the symbolic equations derived using GPSC and demonstrated their performance in malware detection. These equations provide interpretable insights by explicitly showing the relationships between features and classification outcomes, making it easier for practitioners to understand and act on the results. In the presented symbolic expressions we have explicitly shown what input variables are required to comptue the output. For more information please check the subsection “The analysis of the best SEs” in the “Results” section.

We acknowledge the importance of further exploring the tangible advantages of interpretability in comparison to black-box models such as ANN and CNN. While this level of comparison is beyond the scope of the current study, we will include this aspect in future work to provide a more comprehensive analysis of how our interpretable methodology contrasts with traditional black-box approaches. Thank you again for your insightful feedback!

Matthews Correlation Coefficient on real-world malware samples could provide better metric for assessing the work.

Answer: In the revised version of the manuscript we have provided MCC in TBVEs classification performance since TBVE models are the ones that were tested on the intial imbalanced dataset variations. The MCC is used to evaluate the performance on the imbalanced dataset.

We will here emphasize as much as possible: The GPSC was not trained on the imbalanced dataset variations. The GPSC with RVHS and 5FCV was trained only on the balanced dataset variations that where obtained using different oversampling techniques. The TBVES were the only models that were trained on imbalanced dataset variations since the TBVEs consisted of SEs obtained in previous step using GPSC on balanced dataset varaitions. By doing so the TBVEs are not biased to specific class and generally are robust systems since they consist of large number of SEs.

So the MCC evaluation metric was used to evaluate the TBVE classification performance on initial imbalanced dataset. The following results were obtained:

The highest MCC value of TBVE with SEs (SEs obtained on balanced dataset variations with all input variables) tested on the imbalanced dataset with all input variables: 0.93081
The highest MCC value of TBVE with SEs (SEs obtained on balanced dataset variations with a reduced number of input variables through RFC) tested on the imbalanced dataset with a reduced number of input variables through RFC: 0.89352
The highest MCC value of TBVE with SEs (SEs obtained on balanced dataset variations with a reduced number of input variables through PCA) tested on the imbalanced dataset with a reduced number of input variables through RFC: 0.93433
The highest MCC value of Multi TBVE with majority voting: 0.925
The highest MCC value of Multi TBVE with weighted voting: 0.9397

Additionally we have commented on the obtained results specifcially for MCC in discussion section: “The MCC evaluation metric was employed to assess the classification performance of the TBVE framework on the initial imbalanced dataset. The results demonstrate the effectiveness of different configurations and approaches. When SEs obtained from balanced dataset variations with all input variables were applied, the TBVE achieved a highest MCC value of 0.93081 on the imbalanced dataset. However, when the number of input variables was reduced using RFC and SEs were derived from balanced dataset variations, the MCC value of TBVE with aforementioned SEs decreased to 0.89352. On the other hand, the TBVE with SEs obtained on balanced dataset variations with reduced number of input variables through PCA, improved MCC of 0.93433 on the imbalanced dataset.

For Multi-TBVE, the voting strategy significantly influenced performance. Multi-TBVE with majority voting achieved an MCC value of 0.925, while the weighted voting strategy outperformed majority voting, achieving the highest MCC of 0.9379. These results highlight the importance of input variable selection and voting mechanisms in optimizing classification performance, particularly for imbalanced datasets.”

Only a few articles are cited. More articles from 2024 should be incorporated.

Answer: In the revised version of the manuscript the authors have provided an additional 4 papers in which android malware detection was investigated using the artificial intelligence.

Citing the paragraph before the Table 1 in the introduction section: “In \cite{alotaibi2024bioinspired} the authors have integrated swarm intelligence (SI), ANN, and GA, to identify both known and developing Android Malware attacks. In this investigation the averaged achieved accuracy was 99.16. The authors in \cite{xiong2024domain} have used ANN, Logistic Regression (LR), Decision Tree Classifier (DTC), K-Nearest Neighbors (KNN), and Random Forest Classifier (RFC) to detect Android Malware. The highest accuracy in this investigation was achieved using ANN (90\%). The DenseNET169, Xception, InceptionV3, ResNet50, and VGG16 have been used in \cite{ksibi2024efficient} for Android Malware detection. In this paper the Android APK files were converted to binary codes and RGB images for usage as inputs to deep learning models. The results showed that DenseNet169, InceptionV3, and VGG16 achieved classification accuracy of 95.2\%, 95.24\%, and 95.83\%, respectively. The RFC and CNN have been used in \cite{alrabaee2024using} to perform Android Malware classification. The RFC has outperformed the CNN with classification accuracy of 92.55\%. ”

We do hope that this is enough references from 2024.

Reviewer 2 Report

Comments and Suggestions for Authors

The paper proposes an innovative methodology for Android malware detection, utilizing Genetic Programming Symbolic Classifiers (GPSC) and Multi-Ensemble Threshold Techniques. It highlights the use of symbolic expressions for better interpretability, application of various preprocessing and oversampling techniques to handle dataset imbalances, introduction of Threshold-Based Voting Ensembles (TBVE) and a multi-TBVE framework for enhanced classification accuracy, and achieving a classification accuracy of up to 0.98. The study addresses a crucial and timely topic in Android malware detection. The paper is well-structured, with a clear methodology and logical flow. I have the following comments:

Major:
1. The dataset used is highly imbalanced (3565 malware vs. 899 goodware samples), which can bias the classifier. Validate the results using additional datasets or cross-datasets to demonstrate generalizability.

2. While the paper employs oversampling techniques, it lacks a thorough analysis of the potential overfitting risks associated with synthetic data generation. Include a discussion on how overfitting is mitigated during training, especially given the use of synthetic samples.

3. Feature importance using RFC is limited to a threshold of 0.01 without clear justification. The PCA retains 146 components to explain 100% variance, which may not be optimal for model interpretability or computational efficiency. I would recommend Using scree plots or cumulative variance thresholds (e.g., 95%) to determine the optimal number of principal components.

4. The paper achieves high accuracy (up to 0.98) but does not address how the model performs on unseen or adversarial data. Also, there is no discussion of noise, concept drift, or dynamic malware behaviors.

5. Majority and weighted voting in Multi-TBVE are introduced but lack empirical comparisons. Conduct ablation studies comparing majority voting and weighted voting to justify the final selection.

6. While the paper compares its results to past studies, direct experiments using alternative classifiers (e.g., Random Forest, SVM, CNN) on the same dataset are missing. I would recommend implementing and reporting baseline results using non-GPSC methods to provide a fair comparison.

Minor:
1. Line 6-14: The abstract introduces the problem and methodology well, but the performance results are mentioned abruptly. Consider integrating these into a single coherent narrative.

2. Line 33-52: The importance of AI in malware detection is well-explained. Adding more recent trends or limitations in current AI methods could provide additional context.

3. Line 157-198: The explanation of dataset variations is thorough. Consider providing a flowchart summarizing these steps for clarity.

4. Line 294-316: Feature selection through RFC is well-described. However, justification for the threshold of feature importance (0.01) should be provided.

5. Figures 12-13: Results need additional insights into why specific configurations perform better (e.g., scaling techniques) would be valuable. Also, Figure 8 is too detailed for quick interpretation.

6. Table 2: Dataset statistics could benefit from a clearer summary of key findings.

7. In discussion, the implications of using symbolic classifiers compared to deep learning models could be expanded.

8. Overall English language is clear, but minor grammatical and stylistic corrections are needed (e.g., "tray" should be "try" in line 288).

9. Add comparisons with non-GPSC methods to strengthen claims.

Author Response

Reviewer 2

The authors of this manuscript want to thank reviewer 2 for his time and effort to read the manuscript and write his comments and suggestions which greatly improved the manuscript's quality. The authors do hope that the manuscript in this form will be accepted for publication.

The reviewer 2 comments are bold and after each comment the answer by the authors is provided.

Major:

The dataset used is highly imbalanced (3565 malware vs. 899 goodware samples), which can bias the classifier. Validate the results using additional datasets or cross-datasets to demonstrate generalizability.

Answer: We appreciate your comment regarding the potential bias caused by the imbalance in the dataset (3565 malware vs. 899 goodware samples). However, this imbalance was carefully addressed throughout our methodology.

Balancing the Dataset:

The initial imbalanced dataset was not directly used to train the GPSC. Instead, we applied a combination of scaling/normalization techniques and oversampling methods to create multiple balanced variations of the dataset. This approach ensured that the training process was not affected by class imbalance, and the classifier could focus on learning robust patterns.

Training on Balanced Dataset Variations:

Using the balanced dataset variations, we trained the GPSC with 5-fold cross-validation for each variation. This process yielded multiple symbolic expressions (SEs) from each fold. By combining the results from multiple balanced datasets and cross-validation, we generated a large pool of highly accurate SEs, improving both robustness and classification performance.

Testing on the Imbalanced Dataset:

After obtaining the SEs, we constructed a threshold-based voting ensemble (TBVE) using these symbolic expressions. The TBVE was then evaluated on the original imbalanced dataset to validate its classification performance.
The ensemble’s robustness stems from its composition of a large number of SEs, each trained on balanced datasets. This inherently resists the bias introduced by the imbalance in the original dataset, minimizing the risk of overfitting or underfitting.

Fine-Tuning the Ensemble:

By adjusting the threshold value (i.e., the minimum number of consistent predictions by SEs in the ensemble), we further improved classification performance on the imbalanced dataset. This adjustment demonstrated the flexibility and reliability of the TBVE in handling class imbalance.

In summary, by using balanced dataset variations for training and testing the TBVE on the original imbalanced dataset, we ensured that the classifier's performance is robust, resistant to bias, and generalizable. This methodology also mitigates the need for additional datasets, as the ensemble already incorporates a wide variety of SEs generated from diverse balanced datasets.

The key improvements in this approach are:

Clear Structure: The response is broken into logical sections to make it easier to follow.
Conciseness: Repetitions and overly complex sentences were eliminated.
Focus on Addressing the Concern: The answer explicitly explains how class imbalance was addressed and how the methodology ensures robustness and generalizability.
Terminology Clarified: Phrases like "threshold value" and "balanced dataset variations" are clearly explained to avoid ambiguity.

To summarize the methodology. We have not used the imbalance dataset to train GPSC. We have applied different oversampling techniques on large number of preprocessed datasets and obtained balanced dataset variations. On these dataset GPSC wast trained to obtain the best SEs using RHVS and 5FCV. Since the SEs were obtained on BALANCED DATASET VARIATIONS the MCC metric was not used (Matthews correlation coefficient). When we developed the TBVE variations we have tested the TBVE performance on the initial imbalanced dataset. Since each TBVE consisted of large number of SEs the assumtion is that these models are robust and not biased to specific class regardles of the imbalance in the initial dataset. That is why in the revised version of them manuscript we have applied the MCC evaluation metric. The following results were obtained:

The highest MCC value of TBVE with SEs (SEs obtained on balanced dataset variations with all input variables) tested on the imbalanced dataset with all input variables: 0.93081
The highest MCC value of TBVE with SEs (SEs obtained on balanced dataset variations with a reduced number of input variables through RFC) tested on the imbalanced dataset with a reduced number of input variables through RFC: 0.89352
The highest MCC value of TBVE with SEs (SEs obtained on balanced dataset variations with a reduced number of input variables through PCA) tested on the imbalanced dataset with a reduced number of input variables through RFC: 0.93433
The highest MCC value of Multi TBVE with majority voting: 0.925
The highest MCC value of Multi TBVE with weighted voting: 0.9397

The description of the MCC evaluation metric was added in the discussion section. Citing from the revised version of the manuscript (Dicussion section): “The MCC evaluation metric was employed to assess the classification performance of the TBVE framework on the initial imbalanced dataset. The results demonstrate the effectiveness of different configurations and approaches. When SEs obtained from balanced dataset variations with all input variables were applied, the TBVE achieved a highest MCC value of 0.93081 on the imbalanced dataset. However, when the number of input variables was reduced using RFC and SEs were derived from balanced dataset variations, the MCC value of TBVE with aforementioned SEs decreased to 0.89352. On the other hand, the TBVE with SEs obtained on balanced dataset variations with reduced number of input variables through PCA, improved MCC of 0.93433 on the imbalanced dataset.

While the paper employs oversampling techniques, it lacks a thorough analysis of the potential overfitting risks associated with synthetic data generation. Include a discussion on how overfitting is mitigated during training, especially given the use of synthetic samples.

Answer: We appreciate the reviewer’s comment regarding the potential risk of overfitting due to synthetic data generation. However, our methodology inherently mitigates such risks, as outlined below:

Mitigation of Overfitting Through Cross-Validation:

The GPSC was trained using 5-fold cross-validation on each balanced dataset variation. This approach ensures that the model is validated on unseen data during each fold, significantly reducing the likelihood of overfitting to synthetic samples.
For each dataset variation, the 5-fold cross-validation process resulted in five distinct symbolic expressions (SEs). This diversity in SEs across multiple folds further enhances robustness by ensuring that no single dataset or specific synthetic sample overly influences the final model.

Training on Diverse Dataset Variations:

The use of multiple balanced dataset variations, created through a combination of scaling, normalization, and oversampling techniques, ensures that the GPSC is exposed to a wide range of data distributions. This diversity minimizes dependency on any specific synthetic sample, reducing the risk of overfitting.

Evaluation on the Original Imbalanced Dataset:

To validate the generalizability of the obtained SEs and the threshold-based voting ensemble (TBVE), we evaluated the classification performance on the original imbalanced dataset. This approach ensures that the final ensemble’s performance reflects its ability to generalize to real-world, imbalanced conditions, rather than being influenced solely by the synthetic training data.

Benefits of the Threshold-Based Voting Ensemble (TBVE):

The TBVE, constructed from a large number of SEs, inherently resists overfitting due to its ensemble nature. Each SE contributes to the final decision based on a consensus threshold, preventing any single SE from dominating the predictions.
Additionally, the threshold value was fine-tuned to further enhance performance, ensuring robustness and preventing over-reliance on specific patterns from synthetic samples.

In summary, overfitting risks associated with synthetic data generation were effectively mitigated through the use of 5-fold cross-validation, diverse balanced dataset variations, evaluation on the original imbalanced dataset, and the ensemble-based nature of the TBVE. These steps collectively ensure that the GPSC and TBVE achieve high generalizability and robust performance.

Feature importance using RFC is limited to a threshold of 0.01 without clear justification. The PCA retains 146 components to explain 100% variance, which may not be optimal for model interpretability or computational efficiency. I would recommend Using scree plots or cumulative variance thresholds (e.g., 95%) to determine the optimal number of principal components.

Answer: We appreciate the reviewer’s insightful feedback regarding the threshold for feature importance in the Random Forest Classifier (RFC) and the use of 100% variance retention in Principal Component Analysis (PCA). Below, we provide clarification and justification for our choices:

Threshold of 0.01 in RFC:

The threshold of 0.01 for feature importance in RFC was selected empirically as a reasonable baseline to filter out features with negligible contribution to the model's decision-making process. While we acknowledge that this choice was not explicitly optimized in this study, it proved effective in producing a reduced dataset that maintained high classification performance.
The primary goal was to ensure that the selected features retained sufficient discriminatory power while avoiding redundancy. Future work will explore different thresholds (e.g., 0.005, 0.02) and use systematic approaches, such as hyperparameter tuning or permutation importance, to determine the most appropriate threshold for feature selection.

100% Variance Retention in PCA:

Retaining 100% variance in PCA was a deliberate choice to preserve all the information in the original dataset, ensuring no loss of potentially important details. This allowed us to assess the full representational capacity of the dataset when generating symbolic expressions (SEs) using GPSC.
While retaining all components may not always be optimal for computational efficiency, it enabled a direct comparison with the RFC-based feature selection approach. This choice also served as a baseline to validate the robustness of the GPSC in handling high-dimensional data.

Future Improvements:

We agree with the reviewer’s suggestion to consider scree plots or cumulative variance thresholds (e.g., 95%) to determine the optimal number of principal components in future studies. Reducing the number of components to an interpretable subset while maintaining high variance (e.g., >95%) could further enhance the computational efficiency of the model without compromising performance.
Additionally, an ablation study to evaluate the impact of different thresholds for both RFC feature importance and PCA variance retention will be included in future work to identify optimal configurations systematically.

Current Results:

Despite the use of these initial thresholds, the results demonstrated that the reduced dataset derived from both RFC (threshold = 0.01) and PCA (100% variance) retained sufficient information to enable GPSC to produce symbolic expressions (SEs) with high classification performance. This indicates that the chosen thresholds did not negatively impact the robustness or accuracy of the models.

In summary, while the threshold values used in this study were based on empirical choices, their effectiveness is validated by the high classification performance achieved. We recognize the need to systematically explore and justify these thresholds in future work and will incorporate the reviewer’s suggestions to refine the methodology further.

The paper achieves high accuracy (up to 0.98) but does not address how the model performs on unseen or adversarial data. Also, there is no discussion of noise, concept drift, or dynamic malware behaviors.

Answer: We appreciate the reviewer’s comment regarding the evaluation of the model on unseen or adversarial data, as well as considerations for noise, concept drift, and dynamic malware behaviors. Below, we address these concerns in the context of our study and its objectives:

Objective of the Study:

The primary goal of our study was to investigate whether the Genetic Programming Symbolic Classifier (GPSC) could generate symbolic expressions (SEs) with high classification performance. Our focus was on demonstrating the capability of GPSC to produce interpretable and accurate models, rather than extending the study to adversarial data or dynamic behaviors.

Robustness of the Proposed Methodology:

To ensure robustness, we created three distinct dataset variations derived from the original dataset:

The dataset with all 241 input variables.
A reduced dataset with the most important features selected using a random forest classifier (based on feature importance values).
A further reduced dataset created using Principal Component Analysis (PCA).

Each of these variations was subjected to scaling/normalization techniques and five different oversampling methods, generating a large number of balanced dataset variations.
On each balanced dataset, the GPSC was trained using 5-fold cross-validation, resulting in a substantial number of symbolic expressions (SEs). These SEs were subsequently used to construct threshold-based voting ensembles (TBVEs), and multiple TBVEs were further combined to form a multi-TBVE framework.

Performance on Unseen Data:

The final TBVEs and multi-TBVEs were evaluated on the original imbalanced dataset to assess their generalizability and robustness. The use of diverse dataset variations during training ensures that the models are well-equipped to handle unseen data, as they are exposed to a wide range of input scenarios and distributions.

Addressing Noise and Concept Drift:

The methodology inherently addresses noise and concept drift by leveraging the diversity of balanced dataset variations and the ensemble-based nature of the TBVE framework. The threshold mechanism within the TBVEs further enhances resilience by requiring consensus among multiple SEs, thereby mitigating the impact of noisy or inconsistent patterns in the data.

Dynamic Malware Behaviors and Future Work:

While our study primarily focused on static malware features, the robustness of the multi-TBVE framework suggests its potential applicability to dynamic malware behaviors. Future work could explore extending the approach to incorporate temporal or behavioral features to address evolving malware dynamics.

Constraints on Obtaining New Data:

We acknowledge the reviewer’s suggestion to validate the results on additional datasets or cross-datasets. However, obtaining and preprocessing new data, as well as adapting the methodology to these datasets, would require significant time. Given the limited review period available, we were unable to incorporate new datasets into the current study.
Instead, we focused on demonstrating that GPSC could reliably generate SEs with high classification performance under a diverse range of scenarios created from the original dataset. This aligns with the primary objective of our study.

In conclusion, while adversarial data and concept drift were not explicitly addressed in this study, our comprehensive methodology ensures robust performance and demonstrates the ability of GPSC to produce interpretable SEs with high classification performance. We recognize the importance of future work to extend the methodology to include evaluations on adversarial datasets, dynamic malware behaviors, and additional data sources.

Majority and weighted voting in Multi-TBVE are introduced but lack empirical comparisons. Conduct ablation studies comparing majority voting and weighted voting to justify the final selection.

Answer: We appreciate the reviewer's suggestion to conduct ablation studies comparing majority voting and weighted voting in the Multi-TBVE framework. However, the focus of this study was to investigate the feasibility and effectiveness of using Genetic Programming Symbolic Classifiers (GPSC) to produce symbolic expressions (SEs) with high classification performance, as well as to demonstrate how these SEs can be robustly combined within a threshold-based voting ensemble (TBVE) framework.

The Multi-TBVE was introduced as an innovative extension to further enhance the robustness and classification performance by leveraging multiple TBVEs. This design choice is inherently flexible, and users can adapt the voting mechanism (e.g., majority voting, weighted voting) to suit their specific application requirements. In our case, we prioritized demonstrating the general robustness and practicality of the Multi-TBVE rather than performing an exhaustive analysis of voting mechanisms.

While we acknowledge the value of empirical comparisons, conducting ablation studies for majority voting versus weighted voting is outside the scope of this paper. Instead, our primary contribution lies in showcasing the ability of GPSC and Multi-TBVE to achieve highly accurate and interpretable classification results across diverse dataset variations.

Future work will explore detailed empirical analyses of different voting strategies, including their potential trade-offs in terms of classification performance, computational complexity, and interpretability.

In summary, the current implementation and validation of the Multi-TBVE demonstrate its effectiveness in achieving robust classification performance, and we consider this sufficient to meet the objectives of the study.

While the paper compares its results to past studies, direct experiments using alternative classifiers (e.g., Random Forest, SVM, CNN) on the same dataset are missing. I would recommend implementing and reporting baseline results using non-GPSC methods to provide a fair comparison.

Answer: We appreciate the reviewer’s suggestion to include baseline experiments using alternative classifiers such as Random Forest, SVM, and CNN for comparison with the GPSC approach. While we understand the importance of benchmarking against a range of classifiers, the primary focus of this paper was to investigate the potential of GPSC in producing symbolic expressions (SEs) with high classification performance, particularly on balanced dataset variations, and to evaluate the robustness of the resulting threshold-based voting ensemble (TBVE) on imbalanced datasets.

Our goal was not to perform a direct comparison with black-box models like CNNs or SVMs, which are known for their complex architecture and long training times. In contrast, GPSC offers the advantage of generating interpretable symbolic expressions, which are computationally less expensive and easier to implement compared to CNNs and SVMs. These symbolic expressions also provide greater transparency and interpretability, which is a key benefit of our approach.

While CNNs and SVMs are widely used for high classification performance, they often require significant computational resources and hyperparameter tuning, which can be time-consuming and less accessible in certain applications. Our study explicitly demonstrates that GPSC can achieve similar, if not better, performance with reduced computational demands. Additionally, the symbolic expressions generated by GPSC can be more easily interpreted and applied in real-world decision-making processes.

We agree that comparing our results with baseline classifiers like SVM and CNN could be valuable, and we plan to explore these comparisons in future work to further validate the practical advantages of GPSC.

We hope that the contribution of this paper, in demonstrating the feasibility and robustness of GPSC for symbolic classification, is clear. We appreciate the reviewer’s input and will consider these suggestions as part of ongoing research.

Minor:

Line 6-14: The abstract introduces the problem and methodology well, but the performance results are mentioned abruptly. Consider integrating these into a single coherent narrative.

Answer: In the revised version of the manuscript we have integrated the accuracy value to be less abruptly. Hope this solve the problem.

Citing the original version of the manuscript: “Android malware detection using artificial intelligence today is a mandatory tool to prevent cyber attacks. To address this problem in this paper the proposed methodology consists of the application of genetic programming symbolic classifier (GPSC) to obtain symbolic expressions (SEs) that can detect if the android is malware or not. To find the optimal combination of GPSC hyperparameter values the random hyperparameter values search method (RHVS) method and the GPSC were trained using 5-fold cross-validation (5FCV). It should be noted that the initial dataset is highly imbalanced (publicly available dataset). This problem was addressed by applying various preprocessing and oversampling techniques thus creating a huge number of balanced dataset variations and on each dataset variation the GPSC was trained. Since the dataset has many input variables three different approaches were considered: the initial investigation with all input variables, input variables with high feature importance, application of principal component analysis. After the SEs with the highest classification performance were obtained they were used in threshold-based voting ensembles and the threshold values were adjusted to improve classification performance. Multi-TBVE has been developed and using them the robust system for Android malware detection was achieved with the highest accuracy of 0.98 was obtained.”

Citing the revised version of the revised version of the manuscript: “Android malware detection using artificial intelligence has become an essential tool to combat cyberattacks effectively. In this study, we propose a methodology based on a genetic programming symbolic classifier (GPSC) to derive symbolic expressions (SEs) capable of distinguishing between malicious and benign Android applications. The highly imbalanced publicly available dataset used for this study was preprocessed using oversampling techniques, generating multiple balanced dataset variations to improve model robustness. To optimize the GPSC, a random hyperparameter value search (RHVS) method was employed, and the model was trained with 5-fold cross-validation (5FCV). Feature reduction techniques, including feature importance analysis and principal component analysis, were also investigated to address the dataset's high dimensionality. The SEs with the highest classification performance were integrated into threshold-based voting ensembles (TBVEs), with threshold adjustments further enhancing classification accuracy. The developed Multi-TBVE system achieved a robust performance, culminating in a highest accuracy of 0.98, demonstrating its effectiveness for Android malware detection.”

Line 33-52: The importance of AI in malware detection is well-explained. Adding more recent trends or limitations in current AI methods could provide additional context.

Answer: In the revised version of the manuscript we have created an additional paragraph with recent trends-limitations.

Citing the added paragraph from the revised version of the manuscript: “One notable trend is the growing adoption of federated learning, a decentralized approach that allows collaborative training of AI models without sharing raw data. This method enhances data privacy while maintaining detection accuracy, making it particularly relevant for cybersecurity applications. Another significant development is the integration of explainable AI (XAI) techniques, such as SHAP (SHapley Additive exPlanations) and LIME (Local Interpretable Model-Agnostic Explanations), which provide greater transparency and interpretability in decision-making processes, addressing the critical need for trust in AI systems. Additionally, increasing research is focusing on improving the adversarial robustness of AI models, as they remain vulnerable to adversarial attacks that can compromise detection performance. However, despite these advancements, challenges persist, including the need for larger labeled datasets, high computational costs, and generalization issues in detecting novel malware variants. These limitations underscore the importance of continued research and innovation in the field.”

Line 157-198: The explanation of dataset variations is thorough. Consider providing a flowchart summarizing these steps for clarity.

Answer: As you probably know in this paper three different dataset types were considered (before application of scaling/normalizaition and oversmapling techniques) i.e. with all input variables, with reduced number of input variables through RFC feature importance, and through PCA.

The statistical analysis of the dataset with all input variables consists of:

initial statistical analysis (minimum, maximum, mean, standard deviation) for each dataset variable. Initially, this table was available in the manuscript however, since the table was too long the editor suggested that the article was too long and should be shortened. The only way we could shorten the manuscript was to upload the table to GitHub repository.
the Pearson’s correlation analysis was obtained however, due to a large number of input variables (241 input variables) the heatmap could not be visualized in the manuscript. So only one diagram was created in which the correlatio between each variable and the target variable was shown.
The outlier detection could not be visualized due to large number of input variables. So it was not shown in the manuscript.

The statistical analysis of the dataset with a reduced number of input variables through RFC feature importance - For this investigation, initial statistical analysis was shown in the manuscript as well as the Pearson’s correlation heatmap, and boxplot for outlier detection. This was possible due to reduced number of dataset varaibles.

The statistical analysis of the dataset with a reduced number of input variables through PCA - Due to large nuber of input variables (146) the initial statistical analysis (table) could not be shown in the manuscript. This is also valid for correlation heatmap and the boxplot used for outlier detection.

All three dataset varaitions have same class imbalanced and that is shown in Figure 8 (number changed in the revised version of the manuscript).

In the revised version of the manuscript we have added the entire flowchart in the subsection entitled Dataset description and statistcal analysis. Citing from the revised version of the manuscript: “The flowchart of dataset investigation and statistical analysis is shown in Figure \ref{fig:StatAnalysis}.

\begin{figure}[H]

\centering

\captionsetup{justification=centering}

\begin{adjustwidth}{-3cm}{0cm}

\includegraphics[scale=0.3]{Flowchart_stat_analysis.pdf}

\end{adjustwidth}

\caption{\label{fig:StatAnalysis}The flowchart of initial dataset generation and statistical analysis}

\end{figure}

As seen from Figure \ref{fig:StatAnalysis} from the initial dataset three different dataset variations were created and these are: the dataset with all input variables, the dataset with reduced number of input variables through feature importance, and the dataset with reduced number of input variables through PCA. \newline

Initially the statistical analysis of all three datasets consisted of: initial statistical analysis (min, max, mean, std) of all dataset variables, Pearson's correlation analysis, and the boxplot for outlier detection. However, this was not possible for several reasons:

\begin{itemize}

\item in the initial dataset there were too many input variables (241) so the table showing results of the initial statistical analysis would be too long (more than 15 pages). That is why the table is located on GitHub repository (check the appendix section for more information), the correlation analysis is conducted only between each input variable and the target variable and is shown in the paper. The outlier detection was not performed due to a large number of dataset variables.

\item the dataset with reduced number of input variables through RFC feature importance - since there is small number of input variables the initial statistical analysis is performed and shown in table in table form, the Pearson's correlation analysis is performed and shown in form of a heatmap, as well as the boxplot for outlier detection .

\item the dataset with reduced number of input variables through PCA - due to a large number of dataset variables (146) the statistical analysis, correlation heatmap, and boxplot (outlier detection) could not be shown in the paper.

\end{itemize}

Since all datasets are derived from the initial dataset all of them have the same number of samples per class. The balance between class samples was investigated on the initial dataset.”

Line 294-316: Feature selection through RFC is well-described. However, justification for the threshold of feature importance (0.01) should be provided.

Answer: In the revised version of the manuscript we have included the description/justification for the threhsold value. Citing from the revised version of the manuscritp (Dataset reduction through the use of feature importance with Random Forrest Classifier): The threshold of 0.01 for feature importance in the Random Forest Classifier (RFC) was selected based on empirical observations. This value ensures that features contributing minimally to the model's predictive performance are excluded, thereby reducing dimensionality while retaining the most relevant features for classification. A higher threshold may risk excluding potentially informative features, while a lower threshold could result in overfitting due to the inclusion of noise. The choice of 0.01 balances these considerations and effectively optimizes the model’s performance.

Figures 12-13: Results need additional insights into why specific configurations perform better (e.g., scaling techniques) would be valuable. Also, Figure 8 is too detailed for quick interpretation.

Answer: Thank you for the comment. The general idea of application of scaling/normalization and oversampling techniques was to create a large dataset diversity and obtain a large number of datasets that could be used to train the GPSC which would produce in the end a large number of SEs. With the application of the proposed methodology we have created enough SEs that could be used to obtain ROBUST and a highly accurate TBVE system for detection of the android malware.

Figures 8,9, and 10 (in the revised version of the manuscript Figures 9,10, and 11) were enlarged as much as possible. If you can’t interpret the figure then we will explain it here. Figures 9, 10, and 11 represents the number of samples that target variable contains in balanced dataset variations. The idea was to achieve the equal number of samples in target variable for both classes i.e. android goodware (label 0) and android malware (label 1) cthrough the application of different oversampling techniques on the datasets that were already preprocessed with scaling/noramlization techniques. The label_0 represent the android goodware samples in the dataset, and the label_1 in the legend represent the android malware samples.

Table 2: Dataset statistics could benefit from a clearer summary of key findings.

Answer: Thank you for the comment but if you take the closer look at the text below the Table 2 we have describe the table in as much detailes as possible. Citing from the original version of the manuscrip: “The results of statistical analysis shown in Table \ref{tab:Init_stat_RFC} showed that all dataset variables have the same number of samples i.e. 4464. All input variables as well as the target variable have the same minimum and maximum values (0 and 1).For instance, the RECEIVE\_BOOT\_COMPLETED permission has a high mean value of 0.8436, indicating that this permission is frequently used across the dataset, with a standard deviation of 0.3632. Similarly, the WAKE\_LOCK permission exhibits an even higher usage with a mean of 0.8855 and a lower standard deviation of 0.3184, suggesting consistent use across the applications. On the contrary, certain permissions or methods such as Ljava/lang/System;->loadLibrary (mean = 0.1046, std = 0.3061) and Landroid/telephony/TelephonyManager;->getSimOperatorName (mean = 0.0562, std = 0.2304) appear infrequently, with lower mean values and tighter standard deviations. Methods like Ldalvik/system/DexClassLoader;->loadClass (mean = 0.0587, std = 0.2351) and Landroid/content/pm/PackageManager;->getInstalledPackages (mean = 0.0520, std = 0.2220) are used the least, as reflected by their minimal mean values. Permissions related to networking, such as ACCESS\_NETWORK\_STATE and ACCESS\_WIFI\_STATE, exhibit moderate usage with mean values of 0.5963 and 0.3289, respectively, and standard deviations suggesting some variability. Interestingly, the Label column has a high mean value of 0.7986, which might indicate a strong prevalence of a specific classification label associated with these entries.”

However, if you want explicit key findings of magnify the results of statistical analysis we have written the following paragraph below this paragraph after Table 2. Citing from the revised version of the manuscript: “The results of statistical analysis shown in Table \ref{tab:Init_stat_RFC} showed that all dataset variables have the same number of samples i.e. 4464. All input variables as well as the target variable have the same minimum and maximum values (0 and 1).For instance, the RECEIVE\_BOOT\_COMPLETED permission has a high mean value of 0.8436, indicating that this permission is frequently used across the dataset, with a standard deviation of 0.3632. Similarly, the WAKE\_LOCK permission exhibits an even higher usage with a mean of 0.8855 and a lower standard deviation of 0.3184, suggesting consistent use across the applications. On the contrary, certain permissions or methods such as Ljava/lang/System;->loadLibrary (mean = 0.1046, std = 0.3061) and Landroid/telephony/TelephonyManager;->getSimOperatorName (mean = 0.0562, std = 0.2304) appear infrequently, with lower mean values and tighter standard deviations. Methods like Ldalvik/system/DexClassLoader;->loadClass (mean = 0.0587, std = 0.2351) and Landroid/content/pm/PackageManager;->getInstalledPackages (mean = 0.0520, std = 0.2220) are used the least, as reflected by their minimal mean values. Permissions related to networking, such as ACCESS\_NETWORK\_STATE and ACCESS\_WIFI\_STATE, exhibit moderate usage with mean values of 0.5963 and 0.3289, respectively, and standard deviations suggesting some variability. Interestingly, the Label column has a high mean value of 0.7986, which might indicate a strong prevalence of a specific classification label associated with these entries.\newline

Key findings from Table \ref{tab:Init_stat_RFC} are:

\begin{itemize}

\item Feature selection and representation - the table shows a reduced dataset consisting of 23 input variables selected using RFC feature importance. These features are represented in GPSC as symbolic variables ($Z_0$, $Z_1$,...$Z_{22}$).

\item Binary features with Balanced distributions - Most features have values in the 0 to 1 range, indicating binary or normalized attributes. The mean values indicate varying levels of presence or activation of features. For example: WAKE LOCK has a high mean (0.89) which suggest a frequent occurence in the dataset. Landroid / telephony / TelephonyManager getSimOperatorNAme has a low mean (0.05), indicating rare occurence.

\item Standard Deviations highlight variability - Standard deviation values reveal the spread of data for each feature. Features like RECEIVE COMPLETED (Z\_0) and GET TASKS (Z\_3) show relatively low variability.

\item Label Distribution - The target variable ($y$) has a mean of 0.79, suggesting class imbalance with more samples labeled as 1 (positive class).

\end{itemize}”

In discussion, the implications of using symbolic classifiers compared to deep learning models could be expanded.

Answer: In the revised version of the manuscript (discussion section at the end) we have added the explanation of why genetic programming symbolic classifier is better than Deep learning models. Citing from the revised version of the manuscript: “The use of symbolic classifiers, such as the Genetic Programming Symbolic Classifiers (GPSC) employed in this study, offers several distinct advantages compared to deep learning models. One of the primary benefits of symbolic expressions is their inherent interpretability. Unlike deep learning models, which often operate as black boxes, symbolic classifiers provide explicit mathematical equations that reveal the relationships between input features and outputs. This interpretability is crucial in domains like cybersecurity, where understanding the rationale behind a detection decision is essential for trust, transparency, and compliance with regulatory standards.

Another significant advantage is computational efficiency. Symbolic classifiers are less resource-intensive during both training and inference compared to deep learning models, which often require extensive computational power and specialized hardware (e.g., GPUs). This makes symbolic classifiers more accessible for deployment in resource-constrained environments, such as mobile or embedded systems.

Additionally, symbolic expressions are inherently robust to overfitting, particularly when combined with techniques like dimensionality reduction and regularization. They generalize well to unseen data without requiring the massive amounts of labeled data typically needed for training deep learning models.

While deep learning models excel in handling highly complex, high-dimensional data, their reliance on large datasets and computational resources can pose practical limitations. In contrast, symbolic classifiers offer a lightweight, interpretable, and efficient alternative for many real-world applications, including malware detection, where explainability and resource efficiency are often critical.”

Overall English language is clear, but minor grammatical and stylistic corrections are needed (e.g., "tray" should be "try" in line 288).

Answer: Thank you for the comment. To the best of our knowledge, we have corrected grammatical and stylistic corrections in the revised version of the manuscript.

Add comparisons with non-GPSC methods to strengthen claims.

Answer: Again in this study we wanted to investigate if GPSC could produce the SEs for Android Malware Detection and compre the classification accuracy or performance to those results obtained with non-GPSC models. The achieved accuracy confirs we have succeed in what we had envisioned. Training of SVM, RFC, CNN, especially CNN on 105 balanced dataset variations would take more then a month including the analysis of the obtained results due to the fact that CNN for example has a large number of hyperparameters that have to be optimized. The disadvantages of the other research papers i.e. models they have used were clearly explained just after Table 1 based on which the novelty and the idea of this paper was defined.

Citing from the original version of the manuscript: “In Table \ref{tab:other_results} the majority of research papers used the hardly explainable models to predict if the Android application is malware or not. The reported results are excellent however, the major problem is that the majority of these algorithms simply cannot be transformed into mathematical equations that can be easily used. This statement is especially valid for artificial neural networks (ANNs) that cannot be transformed into mathematical equations due to a large number of interconnected neurons.”

However, we have included comparison of non-GPSC models and GPSC used in this research at the end of the discussion. Citing from the revised version of the manuscript: “The use of symbolic classifiers, such as the GPSC employed in this study, offers several distinct advantages compared to deep learning models. One of the primary benefits of symbolic expressions is their inherent interpretability. Unlike deep learning models, which often operate as black boxes, symbolic classifiers provide explicit mathematical equations that reveal the relationships between input features and outputs. This interpretability is crucial in domains like cybersecurity, where understanding the rationale behind a detection decision is essential for trust, transparency, and compliance with regulatory standards.

Round 2

Reviewer 2 Report

Comments and Suggestions for Authors

The authors addressed my comments. I have no further comments.

Article Menu

Application of Symbolic Classifiers and Multi-Ensemble Threshold Techniques for Android Malware Detection

Further Information

Guidelines

MDPI Initiatives

Follow MDPI