Next Article in Journal
Joint Optimization Algorithm for UAV-Assisted Caching and Charging Based on Wireless Energy Harvesting
Previous Article in Journal
A Quantitative-Qualitative Classification for Igneous Building Stones Based on Brazilian Tensile Strength: Application to the Stone Durability
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Review

Advanced Financial Fraud Malware Detection Method in the Android Environment

1
School of Cybersecurity, Korea University, Seoul 02841, Republic of Korea
2
Ministry of Information Security, Hana Bank, Seoul 04523, Republic of Korea
*
Author to whom correspondence should be addressed.
Appl. Sci. 2025, 15(7), 3905; https://doi.org/10.3390/app15073905
Submission received: 12 February 2025 / Revised: 25 March 2025 / Accepted: 27 March 2025 / Published: 2 April 2025
(This article belongs to the Section Computing and Artificial Intelligence)

Abstract

:
The open-source structure and ease of development in the Android platform are exploited by attackers to develop malicious programs, greatly increasing malicious Android apps aimed at committing financial fraud. This study proposes a machine learning (ML) model based on static analysis to detect malware. We validated the significance of private datasets collected from Bank A, comprising 183,938,730 and 11,986 samples of benign and malicious apps, respectively. Undersampling was performed to adjust the proportion of benign applications in the training data because the data on benign and malicious apps were unbalanced. Moreover, 92 datasets were compiled through daily training to evaluate the proposed approach, with benign app data updated over 70 days (D-70 to D-1) and malware app data cumulatively aggregated to address the imbalance. Five ML algorithms were used to evaluate the proposed approach, and the optimal hyperparameter values for each algorithm were obtained using a grid search method. We then evaluated the models using common evaluation metrics, such as accuracy, precision, recall, F1-Score, etc. The LightGBM model was selected for its superior performance, achieving high accuracy and effectiveness. The optimal decision threshold for determining whether an application was malicious was 0.5. Following re-evaluation, the LightGBM model obtained accuracy and F1-Score values of 99.99% and 97.04%, respectively, highlighting the potential of using the proposed model for real-world financial fraud detection.

1. Introduction

Google Play Store offers over 2.85 million applications (apps) [1]. The number of available apps surpassed one million in July 2013 and reached 2.43 million in the fourth quarter of 2023 [2]. These apps provide users with services, such as online shopping, gaming, finance, health, social networking, location tracking, and navigation. According to a report by the International Data Corporation, 326.1 million mobile phones were shipped in the fourth quarter of 2023 [3], and phones based on open-source Android comprise 70.1% of the smartphone market [4].
The openness and extensive adoption of Android make it a crucial target for malicious attackers [5]. A tremendous amount of Android malware has been created and spread and has been used in various illegal activities, including stealing personal information from devices, damaging systems, and phishing. Moreover, unlike the Play Store, untrustworthy app stores allow users to indiscriminately download apps without safety mechanisms, readily exposing them to risk from attackers. The urgency and importance of malware detection have increased, particularly as mobile financial services such as mobile banking and electronic payments have become popular and widespread [6].
New financial fraud malware is often crafted to bypass antiviral software and stealthily infiltrate smartphones. These attacks predominantly focus on Android users, who enjoy considerable freedom in development and application usage. Users who lack awareness of common phishing scams and malware prevention solutions on their phones are the most vulnerable to malware attacks [7]. Financial fraud losses in South Korea, totaling 145.1 billion KRW in 2022, increased by 35.4% to 196.5 billion KRW in 2023, with countless users remaining at risk of financial fraud [8]. Allix et al. [9] identified 22% of the apps on the Google Play Store and 50% of those on App China as malware apps.
Mobile malware is constantly evolving with new features designed to evade detection by malware scanners. Android malware applications typically employ three principal techniques to compromise user devices: repair, update, and download [10].
Various malware detection models have been proposed to protect consumers’ personal information and ensure economic security. These studies can be broadly divided into three types: static [11], dynamic [12], and hybrid [13] analyses. Static analysis enables malware detection without executing an application. Consequently, mobile devices are unaffected by malicious malware [14]. Static analysis provides a secure approach to malware detection. Conversely, dynamic analysis requires the execution of an application within an isolated environment. Owing to the limitations of the operational environment, a dynamic analysis must be performed at runtime to detect suspicious behaviors. However, as the number of Android apps has increased exponentially, the efficiency of dynamic analysis methods to identify malware has decreased. Finally, hybrid analysis combines static and dynamic methods. We propose a static-analysis-based method to identify financial fraud malware. Specifically, static analysis enables the rapid scanning of large applications by examining code or APK files without executing malicious code. It can also identify malicious behavior patterns through an in-depth analysis of the APK structure, such as permission requests and application programming interface (API) calls, and detect malware using specific code patterns or function calls. Additionally, it offers the benefit of a pre-execution review, which ensures that malware can be detected before it is activated. Finally, static analysis conserves resources because it does not require real-world execution environments, making it ideal for efficiently scanning many apps. Our study focuses on malware types that are highly associated with financial fraud. These include billing fraud, stalkerware, hostile downloaders, phishing, and spyware (Figure 1) [15].
  • Billing fraud: Malicious apps that exploit payment systems without user consent, often through Android payment methods or unauthorized transactions [16].
  • Stalkerware: Software installed on a user device without knowledge, enabling third parties to track location, monitor communication, and access personal data [17].
  • Hostile downloaders: Code that does not directly cause harm but downloads other unwanted or malicious software onto the device [18].
  • Phishing: These apps disguise themselves as trusted sources, tricking users into submitting personal and billing information, which is then sent to malicious third parties [19].
  • Spyware: Malicious software that monitors the online activity of a user and collects sensitive information, such as login credentials, banking details, and private data without the consent of the user [20].
In this paper, we present a machine learning (ML) model that leverages static analysis to detect financial fraud malware. ML is well suited for this task because it excels in identifying intricate patterns and generalizing from past data, making it ideal for recognizing evolving malware strategies. Static analysis, which involves examining the code and structure of an app without execution, provides an efficient and safe method to extract features such as permissions, application programming interface (API) calls, and code patterns. The proposed ML model uses static analysis to rapidly analyze large datasets of apps and detect financial fraud malware before execution, thereby minimizing security risks. This approach combines scalability with the ability to detect new and previously undiscovered malware strains. The main contributions of this study are as follows:
  • Static Analysis for Financial Fraud Malware Detection: This study proposes a novel approach utilizing static analysis for detecting financial fraud malware. Compared to dynamic analysis, static analysis is more efficient for real-time detection in operational environments.
  • Real-World Financial Data Utilization and Feature Engineering: The research leverages real-world financial fraud malware data from Bank A in South Korea. Seventy-one features were extracted from eight feature sets, and the accuracy of distinguishing between benign and malicious apps was enhanced by incorporating unique features such as “User activity”, “User information”, and “App package name statistics”, which differentiate this study from previous research.
  • Extensive ML Model Comparison and Optimization: Multiple machine learning models, including logistic regression, random forest, LightGBM, XGBoost, and CatBoost, were evaluated. Optimized hyperparameters were determined using a grid search to enhance model performance.
  • Scalable and High-Performance Malware Detection for Financial Institutions: The LightGBM model achieved an accuracy of 99.99% and an F1-score of 97.04%, demonstrating its effectiveness as a scalable, real-time cybersecurity solution for financial institutions.
The remainder of this paper is organized as follows: Section 2 provides an overview of related work. Section 3 describes the proposed methodology. Section 4 covers the experiments and results and Section 5 concludes the study.

2. Related Work

Various ML techniques have been developed to detect or prevent Android malware. These methods apply static and dynamic analyses, hybrid analysis, and feature extraction.

2.1. Static Analysis-Based Malware Detection

Numerous researchers have explored using static analysis and feature-based methods for detecting Android malware. Table 1 presents a comparison of these studies.
In [21], the authors examined Android applications that detect hidden patterns in highly sensitive malware. An automated malware detection system, MalPat, was introduced and tested on a dataset containing 3185 benign apps and 15,336 malware samples obtained from Google Play Store [22], VirusShare [23], and Contagio [24]. MalPat achieved a remarkable F1-score of 98.24%.
In [25], an ML approach utilizing logistic regression on permissions and API features was used to detect Android malware. The model was evaluated using two datasets obtained from RPAGaurd [26] and Google Play: a full-feature dataset and a reduced-feature dataset containing 131 features. The model achieved accuracies of 97.25% and 95.87% with full- and reduced-feature datasets, respectively.
In [27], the authors proposed DeepClassifyDroid, a deep learning-based system for Android malware detection. The system employs a three-step process: feature extraction, feature embedding, and detection using a convolutional neural network (CNN) on datasets obtained from Drebin [28] and Chinese app markets. Initially, it performs a comprehensive static analysis and generates five distinct feature sets. Finally, DeepClassifyDroid uses a CNN to detect malware. It outperformed most conventional ML methods, achieving a 97.4% detection rate without false alarms. Furthermore, it was 10 times faster than linear support vector machine (SVM) models and 80 times faster than k-nearest neighbor (KNN) models. In [29], the performance of the KNN, a simple yet effective ML classifier, was evaluated by examining various distance measures and hyperparameters. Specifically, the study involved extensive experiments on the Drebin [28] dataset and comparisons of multiple well-known distance metrics. The findings revealed that selecting appropriate distance measures significantly affected classification accuracy. The authors argued that the commonly used Euclidean distance is not optimal for mobile malware detection, and alternative metrics, such as Hamming and CityBlock distances, can improve the classification performance.
In [30], a malware prevention system using an “end-to-end deep-learning architecture” was proposed. The system detects Android malware and assigns attributes based on the opcodes extracted from the malware. The authors of the study demonstrated that a bidirectional long short-term memory (BiLSTM) neural network outperformed state-of-the-art models in detecting the static behavior of Android malware without using the features developed by other models. Their model demonstrated exceptional performance, achieving an accuracy of 99.90% and an F1-score of 99.60% on a large dataset of more than 1.8 million Android apps obtained from the Android malware dataset (AMD) [31], Drebin [28], and VirusShare [23].
In [32], DANdroid, an Android malware detection model, leveraged a discriminative adversarial network (DAN). The contributions of the study were threefold: (1) the proposed method can effectively differentiate adversarial learning outcomes from malware feature representations; (2) it employs a Multi-view deep learning architecture with three feature sets (raw opcodes, permissions, and API calls) to enhance resistance to obfuscation; and (3) the approach demonstrates the ability to generalize to future obfuscation techniques that were not present during model training. The model attained an average F1-score of 97.3% when evaluated using the Drebin [28] dataset.
In [33], the authors proposed a method that extracts the most significant features of Android apps using static analysis supplemented by two newly introduced features. These features were subsequently processed using a functional API-based deep-learning model. The method was evaluated using a newly curated dataset of Android applications comprising 14,079 malware and benign programs obtained from Virus Total [34], AMD [31], MalDozer, and the Contactio Security Blog. Malware samples were organized into four categories. Two experiments were conducted on this dataset. The first focused on binary classification, separating the samples into malware and benign classes. In the other experiment, malware detection was performed to classify the dataset samples into five classes. Their proposed method achieved F1 scores of 99.5% and 97% in classification experiments that considered two and five classes, respectively.
In [35], an advanced and reliable malware detection system that utilizes deep learning algorithms was developed. The proposed system evaluated recurrent neural network (RNN)-based methods, including LSTM, BiLSTM, and gated recurrent unit (GRU), on the CICInVesSandMal2019 [36] dataset, which comprised 8115 static features for malware detection. The BiLSTM model outperformed the other RNN-based approaches, achieving an accuracy of 98.85% and an F1-score of 98.21%.
In [37], a permission-based malware detection system called PerDRaML was proposed. It determined whether an app was malicious based on suspicious permission usage. The system utilized a multistep methodology to extract and identify important features, including permissions, app size, and permission ratios, from a manually collected dataset of 10,000 apps. Moreover, the authors of the study used various ML models to classify apps as malicious or normal and successfully identified the five most important features to predict malicious apps through extensive experimentation. They achieved high malware detection accuracies of 89.7%, 89.96%, 86.25%, and 89.52% using SVM, random forest, rotation forest, and Naive Bayes approaches, respectively, thereby outperforming conventional techniques.

2.2. Dynamic Analysis-Based Malware Detection

This section reviews several notable studies that employ dynamic analysis for Android malware detection. Table 2 presents a comparison of these studies.
In [38], a model was developed for the dynamic analysis of Android applications to detect malware by monitoring the Android API and system calls. The model was evaluated using datasets from the MalGenome Project [39] and VirusShare [23], comprising 7520 apps. Using various classification algorithms, the model achieved an accuracy of 96.66%. However, it exhibited limitations in detecting certain types of malicious behavior. For instance, it does not gather information on the behavior of apps that operate offline and halts execution in such cases.
In [40], a method that relies solely on application runtime behavior was proposed to classify Android malware into families using SVM and conformal prediction techniques. The approach was evaluated on the Drebin dataset [28], using system calls and binder communication features. The proposed approach achieved 94% accuracy for testing apps; however, it encountered difficulties because it tracked only low-level events, which resulted in a collection of limited information and insufficient application coverage. DroidCat, a dynamic classification model for Android apps, was proposed in [41]. The model utilizes various dynamic features such as intents, method calls, and intercomponent communication. DroidCat was tested on a comprehensive dataset of 34,343 apps collected from MalGenome [39], AndroZoo [42], VirusShare [23], and Drebin [28], achieving an accuracy of approximately 97%. The authors highlighted that some dynamic features, such as the distribution of method calls within the source code and the app execution structure that captures library characteristics are more influential in the classification process than features such as sensitive flows.
In [43], system calls were extracted from the CIC-ANDMAL2017 dataset [44] and experiments were conducted using different ML algorithms. It was reported that the KNN and decision tree algorithms achieved F1-scores of 85% and 72% for malware detection and family classification, respectively. In [45], a parallel ML model was developed that incorporated various classifiers, such as J48, KNN, SVM, and random forest, to detect and classify Android malware using dynamic features. Standard ML classifiers were implemented to detect Android malware categories and families by analyzing a large malware dataset containing 14 major malware categories and 180 well-known malware families from CCCS-CIC-AndMal (2020) [46] in the dynamic layer. The authors conducted experiments using several ML algorithms and compared their proposed model with that of the most recent relevant study. Their proposed model achieved accuracies of 96.89% and 99.65 in detecting Android malware categories and families, respectively.

2.3. Hybrid Analysis-Based Malware Detection

This section reviews several studies that have applied hybrid features, combining static and dynamic characteristics for Android malware detection. Table 3 presents a comparison of the studies discussed herein.
In [47], StormDroid, a streamlined ML-based malware detection model that adopts a hybrid approach, was used. It leveraged various features, including permissions, sensitive API calls, sequences, and dynamic behaviors to classify Android malware. The model was evaluated using multiple ML algorithms, such as SVM, C4.5, multi-layer perceptron (MLP), naïve bayes (NB), instance-based k (IBK), and Bagging predictors, on a dataset of approximately 8000 apps collected from Google Play [22] and Contagio [24]. StormDroid enables large-scale analysis by monitoring both static and dynamic behaviors and demonstrates an accuracy of 93.8%. Additionally, it achieved approximately three times the efficiency of the single-threaded models.
MADAM, a cross-layer model that incorporates static and dynamic features such as system calls, short message service (SMS) messages, critical APIs, user activities, and app metadata, was introduced in [48]. The model was evaluated using various algorithms, including KNN, linear discriminant classification (LDC), quadratic discriminant classifier (QDC), MLP, parametric zero-cost proxies (PARZC), and radial basis function (RBF), on large datasets sourced from MalGenome [39], Contagio [24], and VirusShare [23], achieving an accuracy of 96.6%. Additionally, MADAM is designed to minimize battery consumption and imposes a low-performance overhead.
In [49], Androtomist, an open-source tool, was designed for the static and dynamic analyses of Android applications. The tool offers two operational modes: a novice mode for beginners and an expert mode for advanced users. Androtomist was tested on three datasets using various ML classifiers, with an ensemble approach applied to average the results and determine the most significant features. It achieved a perfect accuracy of 100% on the Drebin [28] and VirusShare [23] datasets, whereas its accuracy was slightly lower at 91.8% on the AndroZoo [38] dataset.
Another study [50] proposed a deep learning model for detecting Android malware using a hybrid analysis approach. The model utilized the MalGenome [39] and Drebin [28] datasets for static analysis and the CICMalDroid2020 [51] dataset for dynamic analysis. In total, 261 combined features were extracted for the hybrid analysis. The model was further evaluated using 311 application samples comprising 165 benign apps from the Play Store and 146 malicious apps from VirusShare [23]. The proposed model achieved an accuracy of 99.36%. The authors highlighted that a hybrid analysis approach enhanced the detection rate by approximately 5%.
In [52], a hybrid malware detection method was proposed using tree-augmented Naïve Bayes (TAN) to analyze the conditional dependencies between key static and dynamic features, including API calls, permissions, and system calls required for application functionality. The approach involved training three ridge-regularized logistic regression classifiers to analyze API calls, permissions, and system calls, and employing TAN to model the relationships between these outputs to identify malicious applications. The proposed method achieved sustained detection accuracy of 97% over an extended period.

3. Proposed Malware Detection Approach

The malware detection framework presented in Figure 2 comprises several critical steps. Preprocessing tasks, such as feature extraction, managing missing values, and calculating the undersampling ratio, were performed to mitigate the imbalance between benign and malicious applications [54,55]. Following the daily configuration of the training dataset, the optimal hyperparameters for each ML algorithm were determined using established methods. The algorithms were then evaluated using metrics such as accuracy, precision, recall, and F1-Score and by plotting the area under the precision-recall curve (AUPRC) and the area under the ROC curve (AUROC) diagrams [56]. Based on these evaluations, the optimal decision threshold was selected for the best-performing algorithm [57], and the final malware detection model was applied. Furthermore, malware detection includes classification into five categories: Billing Fraud, Stalkerware, Hostile Downloaders, Phishing, and Spyware. However, in this study, experiments were conducted using only two classes: Benign and Malware. The steps are detailed below.

3.1. Data Collection

In this study, we used real-time data collected by Bank A in South Korea from April 2023 to September 2023 (Figure 3). The fraud detection system (FDS) gathered information on six apps (Appinfo) from the Android smartphones of customers who had installed the mobile banking app of Bank A. Specifically, the system collected data on three apps from verified sources (such as Google Play Store) and three apps from unverified sources (such as third-party stores or APK files) [58]. These apps were installed five days before mobile banking login by the customer. This distinction between trusted and untrusted sources is critical because apps from unverified sources are more vulnerable to malware [59,60,61]. The collected data were subsequently analyzed to detect malicious apps by applying a blacklist or feature pattern analysis method.
The actual data contained information on 183,938,730 and 11,986 transactions for the benign and malware apps, respectively. For malware app information, an FDS staff member of Bank A spoke with customers over the phone and checked the information on the apps installed on their smartphones to determine whether a malicious app was installed. Table 4 and Table 5 present the details of the raw data collected.

3.2. Feature Extraction

Upon obtaining the data, the features for model training were extracted. We analyzed the raw data and added two categories to the six feature sets listed in Table 6: user information and app package name statistics. The selection of “user information” and “app package name statistics” as additional feature sets was based on their relevance in enhancing the detection of financial fraud malware. “User information”, which includes demographic details such as age group and sex, can provide valuable insights into risk patterns associated with malware infections. Certain demographics may be more vulnerable to specific types of malware or targeted attacks. Additionally, “app package name statistics” help identify suspicious or uncommon app names, often used by malicious apps to masquerade as legitimate ones. These features complement traditional static analysis techniques and improve the accuracy of the model in distinguishing between benign and malicious apps. Consequently, eight feature sets were obtained. Moreover, 57 variables were added to the 14 listed in Table 4 to extract 71 features. Table 5 and Table 6 list the final selected feature sets.
  • Target: The “label” in the raw data to determine if an app is malicious. This was used as the dependent variable to train the model.
  • App information: “App package name” and “app size” were selected from the raw data. Some information in “app size” was missing; therefore, a derived variable called “app size Null indicator” was added. “App name” was excluded from this study because it requires further research on natural language processing as it contains data in various languages, including English, Korean, Chinese, and Japanese.
  • App permissions: These were extracted from the manifest.xml file. Permission information is a key element in determining whether an app is malicious, as evidenced by several related studies on static analysis of malware apps [10,25,27,29,33,35,37]. From the information on numerous permissions, this study utilized that on “read SMS permission”, “write SMS permission”, “process outgoing calls permission”, “call phone permission”, “request install packages permission”, and “manage external storage permission?”.
  • App service: A service is a background component that runs independently of the user interface [62]. The services were also extracted from the manifest.xml file. In this study, only services that could determine the default phone app indicators were used.
  • User activity: Classification of elements changed by user activity. We utilized the “App source” and added “app source verification indicator”, “elapsed time after installation”, and “elapsed time after installation Null indicator” as derived variables. Most malicious applications that steal personal information are installed through illegal URLs rather than through publicly available app stores. Hence, “app source” information is crucial in determining whether an app is malicious.
  • User information: The age range and sex of users who install malicious applications are considered important factors. Hence, customer information from Bank A was combined with the raw data to add “age”, “age Null indicator”, and “sex” as additional variables.
  • App package name statistics: According to the analysis results of the app package names in the raw data, 99% of the app package names contained six or fewer words. The following derived variables were created and added to learn the package name patterns: the length of each word up to the sixth word separated by “?”. for the entire package name string, the number of words, the standard deviation of word length, the ratio of numbers, the ratio of vowels, the ratio of consonants, the ratio of maximum consecutive numbers, the ratio of maximum consecutive vowels, and the ratio of maximum consecutive consonants, respectively.

3.3. Data Preprocessing

In the raw data, the numerical variables “app size”, “elapsed time after installation”, and “age”, and the categorical variables “App permissions” and “App service” contained negative or missing values. Thus, the following preprocessing was performed to handle missing values.
  • App size (app_sz): If a value is missing, it is replaced with 0, and the derived variable “app size Null indicator (app_sz_yn)” is created. If “app_sz” is missing, “app_sz_yn” is set to 1, and if not, it is set to 0.
  • Elapsed time after installation (ist_af_psg_drtm): If a value is negative or missing, it is replaced with 0, and the derived variable “elapsed time after installation (ist_af_psg_drtm_yn)” is created. If “ist_af_psg_drtm” is missing, “ist_af_psg_drtm_yn” is set to 1, and if not, it is set to 0.
  • Age (age): If a value is negative or missing, it is replaced with −1, and the “age Null indicator (age_yn)” is created as an additional variable. If “age” is missing, “age_yn” is set to 1, and if not, it is set to 0.
  • App permissions: If the app permissions (read_sms to manage_external_storage) have missing values, they are replaced with “?”. For each permission value, set it to 1 if it is “Y”, 0 if it is “N”, and 1 if it is “?”.
  • App service: If the app service (basc_phone_app) is missing a value, it is replaced with “?”. If the permission value is “Y” or “?”. The value was set to 1; if the permission value was “N”, it was set to 0.

3.4. Dataset

To train and evaluate the raw data collected over six months from April 2023 to September 2023, 92 daily analysis datasets were constructed from 1 April 2023 to 30 September 2023, to evaluate each algorithm. For each analyzed dataset, the training dataset for benign apps was configured for daily training for 70 days, labeled from D-70 to D-1 (Figure 4). For the training dataset for malware apps, the data were cumulatively aggregated from April to resolve the data imbalance. For the test dataset, the evaluation was performed 92 times from July 1 to the evaluation date (Day D).
The ratio of benign app data to malicious app data for the “Train” dataset exceeded 10,000:1, causing an imbalance between classes. Such data imbalance can cause performance issues in ML algorithms and affect the evaluation results (Table 7). Therefore, several studies have used undersampling and oversampling methods to resolve data imbalances [63]. In this study, undersampling was employed to address the significant imbalance between benign and malicious app data because the proportion of benign app data was much larger. This method was chosen for its advantages in improving computational efficiency, enhancing the focus of the model on the minority class (malicious apps), and maintaining data authenticity. Unlike oversampling or SMOTE [64], undersampling minimizes the risks of overfitting and noise, making it a more effective approach, given the characteristics and objectives of the dataset. Hence, we added a logic to sample 10% of the “Train(Benign)” app data. A sampling rate of 10% was determined based on the following experimental process: The test dates were sampled (08/11, 08/16, 08/24, and 08/28 in 2023), and the sampling rate of the “Train(Benign)” app data was adjusted between 2% and 30%. Based on the AUPRC, model performance stabilized when 10% or more of the benign app data were sampled.

3.5. Algorithms

In this study, we performed evaluations by selecting major ML algorithms commonly used to evaluate classification models. A grid search method was used to calculate the optimal hyperparameter values to improve the performance of each algorithm (Table 8).

3.5.1. Logistic Regression

Logistic regression is a predictive analysis technique used to determine the relationship between two or more variables. It examines whether a binary dependent variable is associated with one or more independent variables, which can be ordinal, nominal, interval, or ratio-level [65]. The adjusted hyperparameters include “max_iter”, “class_weight”, “fit_intercept”, “regularization strength”, and “penalt?”.

3.5.2. Random Forest

Random forest is an ensemble learning algorithm that builds multiple decision trees and aggregates their outputs to produce more accurate predictions. When training each decision tree model, the random forest uses the bagging method to train individual decision tree models using a dataset sampled from the entire training dataset, allowing duplicates. The values predicted by these models were averaged to produce the final prediction. The bagging method improves the generalization performance of the prediction model [66]. The adjusted hyperparameters include “n_estimators”, “bootstrap”, “class_weight”, and “max_dept?”.

3.5.3. LightGBM

The LightGBM algorithm is built on a Gradient Boosting Decision Tree (GBDT) framework designed to enhance computational efficiency, especially for large-scale data prediction tasks. This high-performance algorithm can quickly process and distribute vast amounts of data. LightGBM achieves faster training and lower memory usage by using a histogram-based approach and leafwise growth strategy with a maximum depth constraint for trees [67]. The adjusted hyperparameters include “n_estimators”, “learning_rate”, “max_depth”, “class_weight”, “balanced”, “max_delta_step”, “num_leaves”, “colsample_bytree”, “subsample”, “objective”, “binary”, and “boost_from_averag?”.

3.5.4. XGBoost

XGBoost is a powerful ML algorithm that builds decision trees using gradient boosting. XGBoost is favored over other gradient-boosting machines owing to its fast execution, superior model performance, and efficient use of memory resources. This hybrid approach incrementally adds models to correct errors from previous models. It leverages parallel computations to utilize all the available CPUs for tree construction during training. XGBoost enhances computational efficiency and speed by using the “maximum depth” parameter instead of traditional stopping criteria and employs backward tree pruning. Additionally, it incorporates a regularization technique, known as “formalization”, to mitigate overfitting and improve overall performance [68]. The adjusted hyperparameters include “n_estimators”, “learning_rate”, “scale_pos_weight”, and “max_dept?”.

3.5.5. CatBoost

CatBoost is an advanced gradient-boosting algorithm introduced by Prokhorenkova et al. [69]. It is highly effective at handling imbalanced datasets and serves as a strong contender for classification algorithms. CatBoost is a specialized form of gradient boosting for decision trees capable of processing categorical and ordered features while mitigating overfitting through a Bayesian estimator. Unlike many other ML models, CatBoost requires minimal training effort and is versatile across various data types and formats. It supports both central processing unit (CPU) and graphics processing unit (GPU) implementations, with the GPU version enabling significantly faster training compared to other leading GBDT implementations, such as XGBoost and LightGBM, for similarly sized ensembles. CatBoost employs an efficient approach to reduce overfitting, thereby enabling the use of an entire dataset for training [70]. The adjusted hyperparameters include “iterations”, “learning_rate”, “subsample”, “scale_pos_weight” and “dept?”.

3.6. Construction of the Proposed Model

Figure 5 illustrates the construction of the proposed model for transferring various types of features.

3.7. Evaluation Metrics

We evaluated the performance of the proposed model using standard metrics, including accuracy, precision, recall, and F1-score. These metrics were computed using the following formulae:
A c c u r a c y = T P + T N ( T P + T N + F P + F N )
P r e c i s i o n = T P ( T P + F P ) ,
R e c a l l = T P ( T P + F N ) ,
F 1 S c o r e = 2 × P r e c i s i o n × R e c a l l P r e c i s i o n + R e c a l l .
True positives (TP) represent the number of applications correctly identified as malicious. True negatives (TN) refer to the number of applications correctly identified as benign. False positives (FP) are benign applications incorrectly classified as malicious. False negatives (FN) are malicious applications incorrectly classified as benign.
Accuracy measures the overall performance of the classifier and represents the proportion of correct predictions made by the model. However, this is not a reliable metric for imbalanced datasets because it can produce high values even if the model correctly identifies only a single malware app. Recall evaluates the effectiveness of a classifier in detecting malware applications, whereas precision assesses the reliability of the predictions of a classifier. The F1-score, calculated as the harmonic mean of the recall and precision, provides a balanced evaluation by accounting for FNs and FPs.
The AUROC evaluates the ability of the model to distinguish between classes by measuring separability. This is represented as a graphical plot of the FP rate against the TP rate at various thresholds.
The precision-recall curve is commonly used to assess classifiers based on their precision and recall. Generally, precision is plotted on the y-axis, and recall is plotted on the x-axis, creating a two-dimensional graph for comparison.

4. Experimental Results and Discussion

This section details the testing and evaluation of various ML algorithms generally used in previous studies on the 92 datasets discussed earlier. The best-performing algorithm was selected based on model evaluation metrics. Subsequently, an optimal decision threshold was chosen. Thereafter, a comparative evaluation was conducted by varying the training cycle and feature values to verify the superiority of the final model. Finally, the importance of each feature used to train the model was examined.

4.1. Model Evaluation

This section evaluates the logistic regression, random forest, LightGBM, XGBoost, and CatBoost models, which are ML classifiers, on the 92 datasets discussed earlier.

4.1.1. Evaluation of Logistic Regression

Table 9 presents the results of the Logistic Regression classifier evaluation for the datasets. The overall average results of the evaluation for each dataset were as follows: accuracy, 84.94444%; precision, 0.0376%; recall, 92.47%; and F1-Score, 0.0751%. In general, accuracy remains low. The percentage of malware app data that were not detected as malicious was 7.53%, whereas the percentage of benign app data that were not detected as benign was 15.06%. Hence, the precision was noticeably lower than the recall.
Figure 6 illustrates the confusion matrix for the two classes of benign and malware applications. The confusion matrix was predicted to be in the range of 84.9439% to 92.4700%, and the malware app class was predicted to be higher than the benign app class.
The ROC curve for the logistic regression method is presented in Figure 7, with an AUROC value of 0.953914. Figure 8 illustrates the precision–recall curve, where the AUPRC was calculated as 0.014239.

4.1.2. Evaluation of Random Forest

Table 10 presents the evaluation results of the random forest classifier for the 92 datasets. The overall average values of the evaluation results for each dataset are as follows: accuracy, 99.9994%; precision, 98.7150%; recall, 90.6815%; and F1-Score, 94.5278%. In general, the accuracy is high. The percentage of malware app data not detected as malicious was 9.32%. Therefore, recall was evaluated to be noticeably lower than precision.
Figure 9 illustrates the confusion matrix for the two classes of benign and malware apps. The confusion matrix was predicted to be in the range of 90.6815–99.9999%, and the benign app class was predicted to be higher than the malware app class.
The AUROC for the random forest method is plotted in Figure 10 and has a value of 0.999461. Figure 11 illustrates the precision–recall curve; the AUPRC was 0.982923.

4.1.3. Evaluation of LightGBM

Table 11 presents the evaluation results of the LightGBM classifier for each dataset. The overall average results for the datasets are as follows: accuracy, 99.9996%; precision, 96.4526%; recall, 97.2635%; and F1-Score, 96.8564%. Accuracy was evaluated to be high, and precision and recall were evaluated to be balanced without significant differences.
The confusion matrix was predicted to be in the range of 97.2635–99.9998% (Figure 12). Both benign and malware app classes were predicted to be high.
The AUROC for the LightGBM method is plotted in Figure 13 and has a value of 0.999559. Figure 14 illustrates the precision–recall curve, and the AUPRC value was 0.987537.

4.1.4. Evaluation of XGBoost

Table 12 presents the evaluation results of the XGBoost classifier for each dataset. The overall average results for the datasets are as follows: accuracy, 99.9992%; precision, 91.9447%; recall, 95.1350%; and F1-Score, 93.5127%. The accuracy was evaluated as high, and the precision was slightly lower than the recall.
As illustrated in Figure 15, the confusion matrix was predicted to range from 95.1350 to 99.9995%. The benign application class is predicted to be slightly higher than the malware app class.
The AUROC for the XGBoost method is plotted in Figure 16 and has a value of 0.999145. Figure 17 presents the precision–recall curve, and the AUPRC value was 0.963108.

4.1.5. Evaluation of CatBoost

Table 13 presents the evaluation results of the CatBoost classifier for 92 datasets. The overall average results for the datasets were as follows: accuracy, 99.9995%; precision, 92.9774; recall, 98.7480%; and F1-Score, 95.7759%. The accuracy of this method was high. Moreover, although the precision was high, the recall was relatively low.
The confusion matrix was predicted to be in the range of 98.7480–99.9995% (Figure 18). Both benign and malware app classes were predicted to be high.
The AUROC for the CatBoost method is plotted in Figure 19 and has a value of 0.999988. Figure 20 illustrates the precision–recall curve, and the value of the AUPRC was 0.977388.

4.2. Selected Model and Decision Threshold

Table 14 presents a comparison of the evaluation results for each model. The main goal of this study is to accurately detect and classify benign and malware apps to ensure the effectiveness of the model in real-world operational environments. Upon evaluating various classifiers, LightGBM was selected as the most suitable model owing to its exceptional performance across key metrics. LightGBM achieved the highest accuracy (0.999996) and best F1-Score (0.968564), demonstrating an excellent balance between precision and recall, which is essential for handling imbalanced datasets. It also recorded the highest AUPRC (0.987537), highlighting its superior ability to differentiate between benign and malicious apps even in challenging scenarios. Compared with other models, such as random forest and CatBoost, LightGBM consistently balanced the precision, recall, and F1-Score more effectively. Logistic Regression, although computationally simple, struggles significantly with imbalanced data, making it impractical for this task. LightGBM was chosen for its outstanding detection performance and computational efficiency, making it ideal for deployment in resource-constrained or real-time environments. Its scalability and robust generalization across various evaluation metrics further solidified its position as the best choice, outperforming traditional models and advanced gradient-boosting models, such as XGBoost and CatBoost. In summary, LightGBM proved to be the most practical and effective solution for this application.
Additional experiments were conducted to further enhance the performance of LightGBM and determine the optimal decision threshold. The experimental results are presented in Table 15. Based on these results, we selected 0.5 as the optimal decision threshold value. A fixed decision threshold (0.5) was ultimately selected for its ability to balance precision and recall, simplicity of implementation, and suitability for operational environments. Although alternative thresholds, such as 0.603 or dynamic values, offer potential benefits for specific use cases, their added complexity or limited incremental improvements make them less practical for general applications. The final results of our proposed malware detection model exhibited an accuracy of 99.9958%, a precision of 96.6507%, a recall of 97.4277%, and an F1-Score of 97.0376.
  • Threshold 1: Threshold that always changes according to the model and date. For each test date, D, the model was fitted from D-70 to D-15. Subsequently, the threshold that produced a high F1-Score for the benign and malware samples from D-14 to D-1 was selected.
  • Threshold 2: The default threshold value was fixed at 0.5.
  • Threshold 3: The threshold was fixed at a single value of 0.603, which produces high F1-Scores for all 92 datasets measured ex post facto (for reference).

4.3. Comparative Analysis and Statistical Validation of Proposed Model

To validate the performance of our proposed financial fraud malware detection model, we conducted a comprehensive statistical comparison using five feature set variations (Variation1–Variation5) and the proposed model (Table 16). The analysis was performed across four key performance metrics: accuracy, recall, precision, and F1-Score. This section presents the statistical analysis and visualizations using a one-way Analysis of Variance (ANOVA) and box plots to demonstrate the superiority of our proposed model.
To assess the statistical significance of the performance differences across the six feature sets, we applied one-way ANOVA for each performance metric. The null hypothesis (H₀) assumes that the mean performance is the same across all feature sets, while the alternative hypothesis (H₁) assumes that at least one feature set performs significantly better than the others [64]. The ANOVA results were as follows:
  • Accuracy: F-Statistic = 56.22, p-value = 3.91 × 10−47
  • Recall: F-Statistic = 58.19, p-value = 1.60 × 10−48
  • Precision: F-Statistic = 25.25, p-value = 6.16 × 10−23
  • F1-Score: F-Statistic = 43.92, p-value = 4.42 × 10−38
Figure 21 and Figure 22 present histograms of F-statistics and p-values for different metrics. As the p-values for all the metrics were significantly smaller than 0.05, we reject the null hypothesis. This confirms that statistically significant differences existed in the performance of the feature sets for all metrics. Table 17 and Figure 23 present the box plots generated to provide a visual representation of the variability and central tendencies of each feature set across the four performance metrics. The box plots illustrate that the proposed model consistently achieves higher median values and less variability than the other variations, indicating a more stable and superior performance across all datasets [71].
The ANOVA results, supported by visual insights from the box plots, demonstrate that the proposed model significantly outperforms the five variations across all performance metrics. The inclusion of additional features, such as user activity, user information, and app package name statistics, in the proposed model led to notable improvements in accuracy, recall, precision, and F1-Score. These enhancements render the proposed model a reliable solution for detecting fraudulent malware in real-world applications.

4.4. Feature Importance

This section discusses the evaluated contributions of each feature to the final model. To calculate the featurewise contributions, we used TreeExplainer (TreeSHAP), which provides variable importance for each prediction case [72]. Figure 24 presents the average contribution of each feature by date for the final LightGBM. In the “User activity” and “User Information” feature sets, age (age), elapsed time after installation (ist_af_psg_drtm), and app source (app_src) were deemed important. Most malware apps are installed through an abnormal method that cannot verify the source rather than through a normal method, such as the Play Store. Therefore, app source information is crucial in determining whether an app is malicious.
In the “App Information” feature set, app size (app_sz) and app package name (app_pkg_nm) are important parameters. As malware apps are created for specific purposes such as stealing personal information, their app size is much smaller than that of benign apps. The raw data revealed that over 90% of the malware apps had a size of 20 MB or less. Moreover, the second and third words among the words in the app package name significantly influenced the inference of the malware apps. In the “App permissions” feature set, the information on request package installation permissions (request_install_packages) had the largest influence compared to information on other permissions.

4.5. Comparison with Related Works

In this section, the categories of datasets, detection methods, ML models, and performance metrics are discussed and compared with existing studies. The proposed framework outperformed existing tasks in several important aspects, including real-world data usage, scalability of real-time applications, and excellent ML performance (Table 18). Combining static analysis, undersampling, and LightGBM is more efficient and effective in detecting financial fraud malware than existing methods, providing practical solutions for large financial institutions.

5. Limitations and Future Directions

The limitations of the proposed work, along with potential future directions, are as follows:
  • The validity of the study may be influenced by dataset bias, as it relies on data from a single bank. Additionally, challenges related to imbalanced data, potential scalability issues in real-world deployment, and the need for continuous adaptation to evolving malware threats must be considered.
  • Since this model is designed for static analysis-based financial fraud malware detection, it has not been evaluated against advanced malware evasion techniques (e.g., code obfuscation, zero-day exploits). This limitation may result in an overestimation of its actual detection effectiveness against APT.
  • Future research will focus on enhancing the model’s generalizability by incorporating datasets from multiple banks and financial institutions across different countries.
  • Since random undersampling may lead to the loss of critical malicious patterns, potentially affecting detection performance, future work will explore hybrid resampling methods, such as SMOTE, cost-sensitive learning, and ensemble-based strategies, to assess their impact on performance and robustness.
  • Explainable AI (XAI) techniques [73] will be considered to improve the interpretability of ML-based malware detection pipelines. Additionally, incremental learning [74], which is effective in detecting emerging malware, will be explored.
  • Further research will investigate the application of natural language processing techniques on app names to enhance financial fraud malware detection models.

6. Conclusions

This study proposed a new framework for detecting malware apps used to commit financial fraud in Android systems. We utilized the datasets of benign and malware apps analyzed by Bank A in South Korea using the FDS system from April 2023 to September 2023. As the data on benign and malware apps were unbalanced, undersampling was performed to adjust the proportion of benign apps in the training dataset. Moreover, 92 datasets were constructed through daily training to select the optimal model. To test the proposed approach, we used five ML algorithms: logistic regression, random forest, LightGBM, XGBoost, and CatBoost. Upon calculating the optimal hyperparameter values for each algorithm using a grid search, the models were evaluated using the following common evaluation metrics: accuracy, precision, recall, F1-score, AUROC, and AUPRC. We selected the LightGBM model for our evaluation because it achieved the best performance with an accuracy of 99.99% and an F1-Score of 96.86%. Upon selecting 0.5 as the optimal decision threshold to determine whether an app was malicious, the LightGBM model was re-evaluated and yielded an accuracy of 99.99% and an F1-Score of 97.04%. Additionally, the importance of the features in the final model was evaluated. Age, app source, app size, the elapsed time after app installation, package installation request permission, and second- and third-string statistics of the package name were the most influential variables. The major achievement of this study is that we proposed a malware detection model that can be used in real operational environments by adding derived variables “User activity”, “User information”, and “App package name statics” to the “App information”, “App permissions”, and “App service” feature sets, which are commonly used in existing studies on static analysis-based malware detection. The proposed model will be highly effective in detecting financial fraud malware if applied to an FDS that detects abnormal transactions in the financial sector or financial apps.

Author Contributions

Conceptualization, J.S.; methodology, J.S.; validation, D.K.; formal analysis, J.S.; investigation, D.K.; data curation, D.K.; writing—original draft preparation, J.S.; writing—review and editing, J.S.; visualization, D.K.; supervision, K.L.; project administration, K.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data presented in this study are available on request from the corresponding author. The dataset used in this study cannot be publicly shared due to security policies and confidentiality restrictions imposed by Bank A.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
ANOVAAnalysis of variance
APKAndroid package kit
AUPRCArea under the precision–recall curve
AUROCArea under the ROC curve
BiLSTMBidirectional long short-term memory
CNNConvolutional neural network
DANDiscriminative adversarial network
FDSFraud detection system
FNFalse negative
FPFalse positive
GBDTGradient boosting decision tree
GPUGraphics processing unit
GRUGated recurrent unit
IBKInstance-based k
KNNk-nearest neighbor
LDCLinear discriminant classification
LightGBMLight gradient-boosting machine
LSTMLong short-term memory
MLMachine learning
MLPMulti-layer perceptron
PARZCParametric zero-cost
RBFRadial basis function
ROC Receiver operating characteristic
SVMSupport vector machine
TANTree-augmented naïve Bayes
TNTrue negative

References

  1. Karunanayake, N.; Rajasegaran, J.; Gunathillake, A.; Seneviratne, S.; Jourjon, G. A multi-modal neural embeddings approach for detecting mobile counterfeit apps: A case study on Google Play Store. IEEE Trans. Mob. Comput. 2022, 21, 16–30. [Google Scholar] [CrossRef]
  2. Statista, Google Play Store: Number of Apps 2024. 2024. Available online: https://www.statista.com/statistics/266210/number-of-available-applications-in-the-google-play-store/ (accessed on 31 December 2024).
  3. IDC Research, Apple Grabs the Top Spot in the Smartphone Market in 2023 Along with Record High Market Share Despite the Overall Market Dropping 3.2%, According to IDC Tracker. 2024. Available online: https://www.idc.com/getdoc.jsp?containerId=prUS51776424 (accessed on 31 December 2024).
  4. Statista, Mobile OS Market Share Worldwide 2009–2024. 2024. Available online: https://www.statista.com/statistics/272698/global-market-share-held-by-mobile-operating-systems-since-2009/ (accessed on 31 December 2024).
  5. Arora, A.; Peddoju, S.K.; Conti, M. PermPair: Android malware detection using permission pairs. IEEE Trans. Inf. Forensics Secur. 2020, 15, 1968–1982. [Google Scholar] [CrossRef]
  6. Zhu, H.; Gu, W.; Wang, L.; Xu, Z.; Sheng, V.S. Android malware detection based on multi-head squeeze-and-excitation residual network. Expert Syst. Appl. 2023, 212, 118705. [Google Scholar] [CrossRef]
  7. Financial IT, 4 Banking Malware Types Detected on Users’ Devices in 2023. 2023. Available online: https://financialit.net/news/banking/4-banking-malware-types-detected-users-devices-2023 (accessed on 31 December 2024).
  8. Kyung-don, N. [Graphic News] Damages from Phishing Scams Jump over 35%, Korea Herald. 2024. Available online: https://www.koreaherald.com/article/3361908 (accessed on 31 December 2024).
  9. Allix, K.; Bissyandé, T.F.; Jérome, Q.; Klein, J.; State, R.; Le Traon, Y. Empirical assessment of machine learning-based malware detectors for Android. Empir. Softw. Eng. 2016, 21, 183–211. [Google Scholar] [CrossRef]
  10. Odat, E.; Yaseen, Q.M. A novel machine learning approach for Android malware detection based on the co-existence of features. IEEE Access 2023, 11, 15471–15484. [Google Scholar] [CrossRef]
  11. Talha, K.A.; Alper, D.I.; Aydin, C. APK Auditor: Permission-based Android malware detection system. Digit. Investig. 2015, 13, 1–14. [Google Scholar] [CrossRef]
  12. Enck, W.; Gilbert, P.; Han, S.; Tendulkar, V.; Chun, B.-G.; Cox, L.P.; Jung, J.; McDaniel, P.; Sheth, A.N. TaintDroid: An information-flow tracking system for realtime privacy monitoring on smartphones. ACM Trans Comput Syst. 2014, 32, 1–29. [Google Scholar] [CrossRef]
  13. Lindorfer, M.; Neugschwandtner, M.; Weichselbaum, L.; Fratantonio, Y.; van der Veen, V.; Platzer, C. ANDRUBIS—1,000,000 apps later: A view on current Android malware behaviors. In Proceedings of the Third International Workshop on Building Analysis Datasets and Gathering Experience Returns for Security (BADGERS), Wroclaw, Poland, 11 September 2014; pp. 3–17. [Google Scholar] [CrossRef]
  14. Bayazit, E.C.; Koray Sahingoz, O.; Dogan, B. Malware detection in Android systems with traditional machine learning models: A survey. In Proceedings of the 2020 International Congress on Human-Computer Interaction, Optimization and Robotic Applications (HORA), Ankara, Turkey, 26–28 June 2020; pp. 1–8. [Google Scholar] [CrossRef]
  15. Google Play Protect, Malware Categories. Available online: https://developers.google.com/android/play-protect/phacategories (accessed on 31 December 2024).
  16. Beroual, A.; Al-Shaikhli, I.F. A Survey on Android malwares and defense techniques. J. Comput. Theor. Nanosci. 2020, 17, 1557–1565. [Google Scholar] [CrossRef]
  17. EyalSalman, R.T. Android stalkerware detection techniques: A survey study. In Proceedings of the 2023 IEEE Jordan International Joint Conference on Electrical Engineering and Information Technology (JEEIT), Amman, Jordan, 22–24 May 2023; pp. 270–275. [Google Scholar] [CrossRef]
  18. Play Console Help, Hostile Downloaders. Available online: https://support.google.com/googleplay/android-developer/answer/11189134?hl=en# (accessed on 31 December 2024).
  19. Aonzo, S.; Merlo, A.; Tavella, G.; Fratantonio, Y. Phishing attacks on modern Android. In Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security, Toronto, ON, Canada, 15–19 October 2018; Association for Computing Machinery: New York, NY, USA, 2018; pp. 1788–1801. [Google Scholar] [CrossRef]
  20. Kaspersky, Android Spyware Detection & Removal. Available online: https://www.kaspersky.com/resource-center/preemptive-safety/spyware-on-android (accessed on 31 December 2024).
  21. Tao, G.; Zheng, Z.; Guo, Z.; Lyu, M.R. MalPat: Mining patterns of malicious and benign Android apps via permission-related APIs. IEEE Trans. Reliab. 2018, 67, 355–369. [Google Scholar] [CrossRef]
  22. Google. Google Play Store. Available online: https://play.google.com/store/games (accessed on 31 December 2024).
  23. VirusShare.com. Available online: https://virusshare.com/ (accessed on 31 December 2024).
  24. Parkour, M. Contagio Mini-Dump. Available online: http://contagiominidump.blogspot.com/ (accessed on 31 December 2024).
  25. Tiwari, S.R.; Shukla, R.U. An Android malware detection technique based on optimized permissions and API. In Proceedings of the 2018 International Conference on Inventive Research in Computing Applications (ICIRCA), Coimbatore, India, 11–12 July 2018; pp. 258–263. [Google Scholar] [CrossRef]
  26. AndroidPRAGuardDataset. Available online: https://sites.unica.it/pralab/en/AndroidPRAGuardDataset (accessed on 31 December 2024).
  27. Zhang, Y.; Yang, Y.; Wang, X. A novel Android malware detection approach based on convolutional neural network. In Proceedings of the 2nd International Conference on Cryptography, Security and Privacy, Guiyang, China, 16–19 March 2018; Association for Computing Machinery: New York, NY, USA, 2018; pp. 144–149. [Google Scholar] [CrossRef]
  28. Arp, D.; Spreitzenbarth, M.; Hübner, M.; Gascon, H.; Rieck, K. Drebin: Effective and explainable detection of Android malware in your pocket. In Proceedings of the 2014 Network and Distributed System Security Symposium; Internet Society, San Diego, CA, USA, 23–26 February 2014; Available online: https://cir.nii.ac.jp/crid/1363670320772385920 (accessed on 31 December 2024).
  29. Baldini, G.; Geneiatakis, D. A performance evaluation on distance measures in KNN for mobile malware detection. In Proceedings of the 2019 6th International Conference on Control, Decision and Information Technologies (CoDIT), Paris, France, 23–26 April 2019; pp. 193–198. [Google Scholar] [CrossRef]
  30. Amin, M.; Tanveer, T.A.; Tehseen, M.; Khan, M.; Khan, F.A.; Anwar, S. Static malware detection and attribution in Android byte-code through an end-to-end deep system. Future Gener. Comput. Syst. 2020, 102, 112–126. [Google Scholar] [CrossRef]
  31. Rijin, F. Malware Dataset [Dataset]. 2017. Available online: https://www.kaggle.com/datasets/blackarcher/malware-dataset (accessed on 31 December 2024).
  32. Millar, S.; McLaughlin, N.; Martinez del Rincon, J.; Miller, P. Multi-view deep learning for zero-day Android malware detection. J. Inf. Secur. Appl. 2021, 58, 102718. [Google Scholar] [CrossRef]
  33. İbrahim, M.; Issa, B.; Jasser, M.B. A method for automatic Android malware detection based on static analysis and deep learning. IEEE Access 2022, 10, 117334–117352. [Google Scholar] [CrossRef]
  34. VirusTotal. Available online: https://www.virustotal.com/gui/home/upload (accessed on 31 December 2024).
  35. Bayazit, E.C.; Sahingoz, O.K.; Dogan, B. A deep learning based Android malware detection system with static analysis. In Proceedings of the 2022 International Congress on Human-Computer Interaction, Optimization and Robotic Applications (HORA), Ankara, Turkey, 9–11 June 2022; pp. 1–6. [Google Scholar] [CrossRef]
  36. Investigation on Android Malware [Dataset], Datasets|Research|Canadian Institute for Cybersecurity|UNB. 2019. Available online: https://www.kaggle.com/datasets/malikbaqi12/cic-invesandmal2019-dataset (accessed on 31 December 2024).
  37. Akbar, F.; Hussain, M.; Mumtaz, R.; Riaz, Q.; Wahab, A.W.A.; Jung, K.-H. Permissions-based detection of Android malware using machine learning. Symmetry 2022, 14, 718. [Google Scholar] [CrossRef]
  38. Afonso, V.M.; de Amorim, M.F.; Grégio, A.R.A.; Junquera, G.B.; de Geus, P.L. Identifying Android malware using dynamically obtained features. J. Comput. Virol. Hacking Tech. 2015, 11, 9–17. [Google Scholar] [CrossRef]
  39. MalGenome Project, Yajins Homepage. Available online: http://www.malgenomeproject.org/ (accessed on 31 December 2024).
  40. Dash, S.K.; Suarez-Tangil, G.; Khan, S.; Tam, K.; Ahmadi, M.; Kinder, J.; Cavallaro, L. DroidScribe: Classifying Android malware based on runtime behavior. In Proceedings of the 2016 IEEE Security and Privacy Workshops (SPW), San Jose, CA, USA, 22–26 May 2016; pp. 252–261. [Google Scholar] [CrossRef]
  41. Cai, H.; Meng, N.; Ryder, N.; Yao, D. DroidCat: Effective Android malware detection and categorization via app-level profiling. IEEE Trans. Inf. Forensics Secur. 2019, 14, 1455–1470. [Google Scholar] [CrossRef]
  42. Allix, K.; Bissyandé, T.F.; Klein, J.; Traon, Y.L. AndroZoo: Collecting millions of Android apps for the research community. In Proceedings of the 13th International Conference on Mining Software Repositories, Austin, TX, USA, 14–22 May 2016; ACM: New York, NY, USA, 2016; pp. 468–471. [Google Scholar] [CrossRef]
  43. Shakya, S.; Dave, M. Analysis, detection, and classification of Android malware using system calls. arXiv 2022, arXiv:2208.06130. [Google Scholar] [CrossRef]
  44. Lashkari, A.H.; Kadir, A.F.; Taheri, L.; Ghorbani, A.A. Toward Developing a Systematic Approach to Generate Benchmark Android Malware Datasets and Classification. In Proceedings of the 52nd IEEE International Carnahan Conference on Security Technology (ICCST), Montreal, QC, Canada, 22–25 October 2018. [Google Scholar]
  45. Hashem El Fiky, A.; Madkour, M.A.; El Shenawy, A. Android malware category and family identification using parallel machine learning. J. Inf. Technol. Manag. 2022, 14, 19–39. [Google Scholar] [CrossRef]
  46. Rahali, A.; Lashkari, A.H.; Kaur, G.; Taheri, L.; Gagnon, F.; Massicotte, F. DIDroid: Android Malware Classification and Characterization Using Deep Image Learning. In Proceedings of the 10th International Conference on Communication and Network Security (ICCNS2020), Tokyo, Japan, 27–29 November 2020; pp. 70–82. [Google Scholar]
  47. Chen, S.; Xue, M.; Tang, Z.; Xu, L.; Zhu, H. StormDroid: A streaminglized machine learning-based system for detecting Android malware. In Proceedings of the 11th ACM on Asia Conference on Computer and Communications Security, Xi’an China, 30 May–3 June 2016; Association for Computing Machinery: New York, NY, USA, 2016; pp. 377–388. [Google Scholar] [CrossRef]
  48. Saracino, A.; Sgandurra, D.; Dini, G.; Martinelli, F. MADAM: Effective and efficient behavior-based Android malware detection and prevention. IEEE Trans. Dependable Secure Comput. 2018, 15, 83–97. [Google Scholar] [CrossRef]
  49. Kouliaridis, V.; Kambourakis, G.; Geneiatakis, D.; Potha, N. Two anatomists are better than one—Dual-level Android malware detection. Symmetry 2020, 12, 1128. [Google Scholar] [CrossRef]
  50. Hadiprakoso, R.B.; Kabetta, H.; Buana, I.K.S. Hybrid-based malware analysis for effective and efficiency Android malware detection. In Proceedings of the 2020 International Conference on Informatics, Multimedia, Cyber and Information System (ICIMCIS), Jakarta, Indonesia, 19–20 November 2020; pp. 8–12. [Google Scholar] [CrossRef]
  51. Mahdavifar, S.; Kadir, A.F.A.; Fatemi, R.; Alhadidi, D.; Ghorbani, A.A. Dynamic Android Malware Category Classification using Semi-Supervised Deep Learning. In Proceedings of the 18th IEEE International Conference on Dependable, Autonomic, and Secure Computing (DASC), Calgary, AB, Canada, 17–24 August 2020. [Google Scholar]
  52. Surendran, R.; Thomas, T.; Emmanuel, S. A TAN based hybrid model for Android malware detection. J. Inf. Secur. Appl. 2020, 54, 102483. [Google Scholar] [CrossRef]
  53. sk3ptre. sk3ptre/AndroidMalware_2019. 2024. Available online: https://github.com/sk3ptre/AndroidMalware_2019 (accessed on 31 December 2024).
  54. Yuan, Z.; Lu, Y.; Xue, Y. Droiddetector: Android malware characterization and detection using deep learning. Tsinghua Sci. Technol. 2016, 21, 114–123. [Google Scholar] [CrossRef]
  55. He, H.; Garcia, E.A. Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 2009, 21, 1263–1284. [Google Scholar] [CrossRef]
  56. Saito, T.; Rehmsmeier, M. The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PLoS ONE 2015, 10, e0118432. [Google Scholar] [CrossRef]
  57. Provost, F.J.; Fawcett, T.; Kohavi, R. The case against accuracy estimation for comparing induction algorithms. In Proceedings of the Fifteenth International Conference on Machine Learning, Madison, WI, USA, 24–27 July 1998; Morgan Kaufmann Publishers Inc.: San Francisco, CA, USA, 1998; pp. 445–453. [Google Scholar]
  58. Google Cloud Platform Console Help, Unverified Apps. Available online: https://support.google.com/cloud/answer/7454865?hl=en (accessed on 31 December 2024).
  59. Zhou, Y.; Jiang, X. Dissecting Android malware: Characterization and evolution. In Proceedings of the 2012 IEEE Symposium on Security and Privacy, San Francisco, CA, USA, 20–23 May 2012; pp. 95–109. [Google Scholar] [CrossRef]
  60. Felt, A.P.; Greenwood, K.; Wagner, D. The effectiveness of application permissions. In Proceedings of the 2nd USENIX Conference on Web Application Development, Portland, OR, USA, 15–16 June 2011; Available online: https://www.usenix.org/events/webapps11/tech/final_files/Felt.pdf (accessed on 31 December 2024).
  61. Shabtai, A.; Fledel, Y.; Elovici, Y. Securing Android-powered mobile devices using SELinux. IEEE Secur. Priv. 2010, 8, 36–44. [Google Scholar] [CrossRef]
  62. Android Developers, Services Overview|Background Work. Available online: https://developer.android.com/develop/background-work/services (accessed on 31 December 2024).
  63. Mohammed, R.; Rawashdeh, J.; Abdullah, M. Machine learning with oversampling and undersampling techniques: Overview study and experimental results. In Proceedings of the 2020 11th International Conference on Information and Communication Systems (ICICS), Irbid, Jordan, 7–9 April 2020; pp. 243–248. [Google Scholar] [CrossRef]
  64. Vijayvargiya, S.; Kumar, L.; Murthy, L.; Misra, S.; Krishna, A.; Padmanabhuni, S. Empirical analysis for investigating the effect of machine learning techniques on malware prediction. In Proceedings of the 18th International Conference on Evaluation of Novel Approaches to Software Engineering ENASE, Lisbon, Portugal, 24–25 April 2023; SciTePress: Prague, Czech Republic, 2023; pp. 453–460. [Google Scholar] [CrossRef]
  65. Statistics Solutions, What Is Logistic Regression? Available online: https://www.statisticssolutions.com/free-resources/directory-of-statistical-analyses/what-is-logistic-regression/ (accessed on 31 December 2024).
  66. Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
  67. Ke, G.; Meng, Q.; Finley, T.; Wang, T.; Chen, W.; Ma, W.; Ye, Q.; Liu, T.-Y. LightGBM: A highly efficient gradient boosting decision tree. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Advances in Neural Information Processing, Systems. Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R., Eds.; Curran Associates Inc.: Red Hook, NY, USA, 2017. Available online: https://proceedings.neurips.cc/paper_files/paper/2017/file/6449f44a102fde848669bdd9eb6b76fa-Paper.pdf (accessed on 30 December 2017).
  68. Chen, T.; Guestrin, C. XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; Association for Computing Machinery: New York, NY, USA, 2016; pp. 785–794. [Google Scholar] [CrossRef]
  69. Prokhorenkova, L.; Gusev, G.; Vorobev, A.; Dorogush, A.V.; Gulin, A. CatBoost: Unbiased boosting with categorical features. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, Montréal, QC, Canada, 3–8 December 2018. [Google Scholar]
  70. Dorogush, A.V.; Ershov, V.; Gulin, A. CatBoost: Gradient boosting with categorical features support. arXiv 2018, arXiv:1810.11363. [Google Scholar] [CrossRef]
  71. Kumar, L.; Hota, C.; Mahindru, A.; Neti, L.B.M. Android malware prediction using extreme learning machine with different kernel functions. In Proceedings of the 15th Asian Internet Engineering Conference, Phuket, Thailand, 7–9 August 2019; Association for Computing Machinery: New York, NY, USA, 2019; pp. 33–40. [Google Scholar] [CrossRef]
  72. Lundberg, S.M.; Erion, G.; Chen, H.; DeGrave, A.; Prutkin, J.M.; Nair, B.; Katz, R.; Himmelfarb, J.; Bansal, N.; Lee, S.-I. Explainable AI for trees: From local explanations to global understanding. arXiv 2019, arXiv:1905.04610. [Google Scholar] [CrossRef]
  73. Nascita, A.; Aceto, G.; Ciuonzo, D.; Montieri, A.; Persico, V.; Pescapé, A. A survey on explainable artificial intelligence for Internet traffic classification and prediction, and intrusion detection. IEEE Commun. Surv. Tutor. 2024. [Google Scholar] [CrossRef]
  74. Xu, X.; Zhang, X.; Zhang, Q.; Wang, Y.; Adebisi, B.; Ohtsuki, T.; Sari, H.; Gui, G. Advancing malware detection in network traffic with self-paced class incremental learning. IEEE Internet Things J. 2024, 11, 21816–21826. [Google Scholar] [CrossRef]
Figure 1. Malware categories.
Figure 1. Malware categories.
Applsci 15 03905 g001
Figure 2. Suggested framework for detecting malware.
Figure 2. Suggested framework for detecting malware.
Applsci 15 03905 g002
Figure 3. Low data collection process.
Figure 3. Low data collection process.
Applsci 15 03905 g003
Figure 4. Configuring experimental datasets.
Figure 4. Configuring experimental datasets.
Applsci 15 03905 g004
Figure 5. Construction of proposed model.
Figure 5. Construction of proposed model.
Applsci 15 03905 g005
Figure 6. Confusion matrix of Logistic Regression classifier with two target classes: Malware and Benign.
Figure 6. Confusion matrix of Logistic Regression classifier with two target classes: Malware and Benign.
Applsci 15 03905 g006
Figure 7. ROC–AUC obtained using Logistic Regression.
Figure 7. ROC–AUC obtained using Logistic Regression.
Applsci 15 03905 g007
Figure 8. Precision-Recall curve obtained using Logistic Regression.
Figure 8. Precision-Recall curve obtained using Logistic Regression.
Applsci 15 03905 g008
Figure 9. Confusion matrix of Random Forest classifier with two target classes.
Figure 9. Confusion matrix of Random Forest classifier with two target classes.
Applsci 15 03905 g009
Figure 10. ROC–AUC curve obtained using Random Forest.
Figure 10. ROC–AUC curve obtained using Random Forest.
Applsci 15 03905 g010
Figure 11. Precision-Recall curve obtained using Random Forest.
Figure 11. Precision-Recall curve obtained using Random Forest.
Applsci 15 03905 g011
Figure 12. Confusion matrix of LightGBM classifier with two target classes.
Figure 12. Confusion matrix of LightGBM classifier with two target classes.
Applsci 15 03905 g012
Figure 13. ROC–AUC curve obtained using LightGBM.
Figure 13. ROC–AUC curve obtained using LightGBM.
Applsci 15 03905 g013
Figure 14. Precision-Recall curve obtained using LightGBM.
Figure 14. Precision-Recall curve obtained using LightGBM.
Applsci 15 03905 g014
Figure 15. Confusion matrix of XGBoost classifier with two target classes.
Figure 15. Confusion matrix of XGBoost classifier with two target classes.
Applsci 15 03905 g015
Figure 16. ROC–AUC curve obtained using XGBoost.
Figure 16. ROC–AUC curve obtained using XGBoost.
Applsci 15 03905 g016
Figure 17. Precision-Recall curve obtained using XGBoost.
Figure 17. Precision-Recall curve obtained using XGBoost.
Applsci 15 03905 g017
Figure 18. Confusion matrix of CatBoost classifier with two target classes.
Figure 18. Confusion matrix of CatBoost classifier with two target classes.
Applsci 15 03905 g018
Figure 19. ROC–AUC curve obtained using CatBoost.
Figure 19. ROC–AUC curve obtained using CatBoost.
Applsci 15 03905 g019
Figure 20. Precision-Recall curve obtained using CatBoost.
Figure 20. Precision-Recall curve obtained using CatBoost.
Applsci 15 03905 g020
Figure 21. Histogram of f-statistics for different metrics.
Figure 21. Histogram of f-statistics for different metrics.
Applsci 15 03905 g021
Figure 22. Histogram of p-values for different metrics.
Figure 22. Histogram of p-values for different metrics.
Applsci 15 03905 g022
Figure 23. Boxplots for evaluation metrics for feature selection techniques.
Figure 23. Boxplots for evaluation metrics for feature selection techniques.
Applsci 15 03905 g023
Figure 24. Datewise average feature importance for the proposed model.
Figure 24. Datewise average feature importance for the proposed model.
Applsci 15 03905 g024
Table 1. Comparison of some related studies that utilized static analysis.
Table 1. Comparison of some related studies that utilized static analysis.
StudiesYearFeature(s)Dataset(s)Algorithm(s)Performance Evaluation
Tao et al. [21]2017API callsGoogle Play [22], VirusShare [23], and Contagio [24]DTF1-score: 98.24%
Tiwari et al. [25]2018Permissions, API callsRPAGaurd [26] and Google PlayLogistic regressionAccuracy: 97.25%
Accuracy: 95.87%
Zang et al. [27]2018Permissions, Intent filters, API calls, Constant stringsDrebin [28] and Chinese app marketsCNNAccuracy: 97.40%
Baldini et al. [29]2019Permissions, API calls, Components, Network addressesDrebin [28]KNNAccuracy: 99.48%
Amin et al. [30]2020OpcodesAMD [31], Drebin [28], and VirusShare [23]BiLSTM, LSTM,
CNN, and DBN
Accuracy: 99.90%
F1-score: 99.60%
Millar et al. [32]2021Raw opcodes, Permissions and API callsDrebin [28]DANF1-score: 97.30%
İbrahim et al. [33]2022Permissions, Services, API calls, Broadcast receivers, File size, Fuzzy Hash, Opcode sequenceVirus Total [34], AMD [31], MalDozer, and Contactio Security BlogFunctional API-based deep learningF1-score: 99.5%,
F1-score: 97%
Bayazit et al. [35]2022Permissions, IntentsCICInvesAndMal2019 [36]RNN-based LSTM, BiLSTM, and GRUAccuracy: 98.85%
F1-score: 98.21%
Akbar et al. [37]2022PermissionsVirusShare [23]Random forest, SVM, rotation forest, and Naïve BayesAccuracy: Greater than or equal to 89%
Proposed Method App information,
App permissions,
App Service,
User activity,
User information,
App package name statistics
Private DatasetLightGBMAccuracy: 99.99%
F1-score: 97.04%
Table 2. Comparison of related studies that utilized dynamic analysis.
Table 2. Comparison of related studies that utilized dynamic analysis.
StudiesYearFeature(s)Dataset(s)Algorithm(s)Performance Evaluation
Afonso et al. [38]2015API calls and System callsMalGenome [39] and VirusShare [23]RF, J.48, Simple Logistic, NB, SMO, BayesNet, and IBKAccuracy: 96.66%
Dash et al. [40]2016Binder communication, System callsDrebin [28]SVMAccuracy: 94.00%
Cai et al. [41]2019Method calls and Inter-component communication (ICC), IntentsDrebin [28], MalGenome [39], VirusShare [23], and AndroZoo [42]DroidCatAccuracy: 97.00%
Shakya et al. [43]2022System callsCIC-ANDMAL2017 [44]KNN and
Decision tree
F1-score: 85%
F1-score: 72%
El Fiky et al. [45]2022Memory Features,
APIs, Networks, Batterys, Logcats, Processes
CCCS-CIC-AndMal (2020) [46]J48, KNN, SVM, and random forestAccuracy: 96.89%
Accuracy: 99.65%
Table 3. Comparison of related studies that utilized hybrid analysis.
Table 3. Comparison of related studies that utilized hybrid analysis.
StudiesYearFeature(s)Dataset(s)Algorithm(s)Performance Evaluation
Chen et al. [47]2016Permissions, Sensitive API Call, Dynamic behavior SequencesGoogle Play [22] and
Contagio [24]
StormDroid (SVM, MLP, C4.5, IBK, NB, and Bagging predictor)Accuracy: 93.80%
Saracino et al. [48]2018System calls, SMS, Critical API, User Activity, App MetadataMalGenome [39], Contagio [24], and VirusShare [23]K-NN, QDC, LDC, PARZC, MLP, and RBFAccuracy: 96.90%
Kouliaridis et al. [49]2020API calls, Permissions, Intents, Network traffic, Java classes, Inter-process communicationDrebin [28], VirusShare [23], and AndroZoo [42]LR, Naïve Bayes, RF,
KNN, SGC, Adaboost, SVM, and Ensemble
Drebin: 100%,
VirusShare: 100%,
AndroZoo: 91.8%
Hadiprakoso et al. [50]2020 MalGenome [39], Drebin [28], and CICMalDroid 2020 [51]Gradient boost (GB)Accuracy: 99.36%
Surendran et al. [52]2020API calls, Permissions, System callsDrebin [28], AMD [31],
AndroZoo [38], Github [53], and Google Play
Tree Augmented Naïve Bayes (TAN)Accuracy: 97.00%
Table 4. Variable information of the collected raw data.
Table 4. Variable information of the collected raw data.
CategoryVariable NameDescriptionOther Information
Targetlbl_cdLabel“Benign” or “Malware”
Datetrsc_dtTransaction dateLogin date
App informationapp_nmApp name
app_pkg_nmApp package name
app_szApp size
app_ist_dtApp installation date
App permissionsread_smsSMS read permission“Y” or “N”
receive_smsSMS write permission“Y” or “N”
process_outgoing_callsPermission to process outgoing calls“Y” or “N”
call_phonePermission to make calls“Y” or “N”
request_install_packagesPermission to request the installation of packages“Y” or “N”
manage_external_storagePermission to manage shared storage“Y” or “N”
App servicebasc_phone_appWhether it is the default phone app“Y” or “N”
User activityapp_srcApp source“Source Verification” or “Source Unverified”
Table 5. Information on selected feature sets—Part 1.
Table 5. Information on selected feature sets—Part 1.
Feature SetFeature NameDescriptionType
Targetlbl_cdlabelCategorical
Datetrsc_dtTransaction dateDate
App informationapp_pkg_nmApp package nameCategorical
app_szApp sizeNumerical
app_sz_ynWhether the app size is NullCategorical
App permissionsread_smsSMS read permissionCategorical
receive_smsSMS write permissionCategorical
process_outgoing_callsPermission to process outgoing callsCategorical
call_phonePermission to make callsCategorical
request_install_packagesPermission to request the installation of packagesCategorical
manage_external_storagePermission to manage shared storageCategorical
App servicebasc_phone_appWhether it is the default phone appCategorical
User activityapp_srcApp sourceCategorical
app_src_cnfm_ynWhether the app source is verifiedCategorical
Ist_af_psg_drtmThe elapsed time after the installation (Installation date to Transaction date)Numerical
Ist_af_pst_drtm_ynWhether the elapsed time after the installation is NullCategorical
User informationageAgeNumerical
age_ynWhether the age is NullCategorical
gndr_dv_cdGenderCategorical
Table 6. Information on selected feature sets—Part 2.
Table 6. Information on selected feature sets—Part 2.
Feature SetFeature NameDescriptionType
App package name statisticsnm_lenLengthNumerical
word_lenNumber of wordsNumerical
word_list_len_stdThe standard deviation of the length of individual words in the word listNumerical
word_list_len_meanThe mean of the length of individual words in the word listNumerical
num_ratioThe ratio of numbersNumerical
vowel_ratioThe ratio of vowelsNumerical
consonant_ratioThe ratio of consonantsNumerical
consecutive_num_ratioMAX (the ratio of consecutive numbers)Numerical
consecutive_vowel_ratioMAX (the ratio of consecutive vowels)Numerical
consecutive_consonant_ratioMAX (the ratio of consecutive consonants)Numerical
word_1_lenThe length of the first wordNumerical
word_1_num_ratioThe ratio of numbers in the first wordNumerical
word_1_vowel_ratioThe ratio of vowels in the first wordNumerical
word_1_consonant_ratioThe ratio of consonants in the first wordNumerical
word_1_consecutive_num_ratioMAX (the ratio of consecutive numbers in the first word) Numerical
word_1_consecutive_vowel_ratioMAX (the ratio of consecutive vowels in the first word)Numerical
word_1_consecutive_consonant_ratioMAX (the ratio of consecutive consonants in the first word)Numerical
word_6_consecutive_num_ratioMAX (the ratio of consecutive numbers in the sixth word) Numerical
word_6_consecutive_vowel_ratioMAX (the ratio of consecutive vowels in the sixth word)Numerical
word_6_consecutive_consonant_ratioMAX (the ratio of consecutive consonants in the sixth word)Numerical
Table 7. Number of benign and malware apps by dataset.
Table 7. Number of benign and malware apps by dataset.
Model Set
(Daily)
Train SetTest Set
BenignMalwareBenignMalware
Dataset 166,027,3706397841,13118
Dataset 266,238,9106415607,31624
Dataset 366,287,40064391,180,39181
-------------------------
Dataset 9071,891,56011,975689,7027
Dataset 9171,499,66011,982534,3514
Dataset 9270,938,52011,986605,3942
Table 8. Calculated optimal hyperparameter values for different algorithms.
Table 8. Calculated optimal hyperparameter values for different algorithms.
AlgorithmHyperparameter Grid
Logistic Regression{“max_iter”: 100, “class_weight”: “balanced”, “fit_intercept”: True, “C” (regularization strength): 1.0, “penalty”: “l2”}
Random Forest{“n_estimators”: 200, “class_weight”: “balanced”, “max_depth”: 64, “bootstrap”: false}
LightGBM{“n_estimators”: 300, “learning_rate”: 0.1, “max_depth”: −1, “class_weight”: “balanced”, “max_delta_step”: 100, “num_leaves”: 128, “colsample_bytree”: 0.5, “subsample”: 1, “objective”: “binary”, “boost_from_average”: “false”}
XGBoost{“n_estimators”: 200, “learning_rate”: 0.1, “scale_pos_weight”: num_neg_samples/num_pos_samples, “max_depth”: 0}
CatBoost{“iterations”: 200, “learning_rate”: 0.03, “subsample”: 0.8, “scale_pos_weight”: num_benign_samples/num_malware_samples, “depth”: 16}
Table 9. Results of evaluation metrics results using the Logistic Regression classifier on 92 datasets.
Table 9. Results of evaluation metrics results using the Logistic Regression classifier on 92 datasets.
Model Dataset
(Daily)
AccuracyPrecisionRecallF1-Score
Dataset 10.8564190.0001491.0000000.000298
Dataset 20.8602460.0002710.9583330.000542
Dataset 30.8536660.0004110.8765430.000821
-------------------------
Dataset 910.8603660.0000731.0000000.000145
Dataset 920.8782490.0000611.0000000.000123
Total_Avg0.8494440.0003760.9247000.000751
Table 10. Results of evaluation metrics results using the Random Forest classifier on 92 datasets.
Table 10. Results of evaluation metrics results using the Random Forest classifier on 92 datasets.
Model Dataset
(Daily)
AccuracyPrecisionRecallF1-Score
Dataset 10.9999991.0000000.9444440.971429
Dataset 21.0000001.0000001.0000001.000000
Dataset 30.9999971.0000000.9506170.974684
-------------------------
Dataset 911.0000001.0000001.0000001.000000
Dataset 921.0000001.0000001.0000001.000000
Total_Avg0.9999940.9871500.9068150.945278
Table 11. Results of evaluation metrics using the LightGBM classifier on 92 datasets.
Table 11. Results of evaluation metrics using the LightGBM classifier on 92 datasets.
Model Dataset
(Daily)
AccuracyPrecisionRecallF1-Score
Dataset 10.9999980.9000001.0000000.947368
Dataset 21.0000001.0000001.0000001.000000
Dataset 31.0000001.0000000.9782610.989011
-------------------------
Dataset 911.0000001.0000001.0000001.000000
Dataset 921.0000001.0000001.0000001.000000
Total_Avg0.9999960.9645260.9726350.968564
Table 12. Results of evaluation metrics for the XGBoost classifier on 92 datasets.
Table 12. Results of evaluation metrics for the XGBoost classifier on 92 datasets.
Model Dataset
(Daily)
AccuracyPrecisionRecallF1-Score
Dataset 10.9999980.9000001.0000000.947368
Dataset 21.0000001.0000001.0000001.000000
Dataset 30.9999960.9418601.0000000.970060
-------------------------
Dataset 910.9999980.8000001.0000000.888889
Dataset 921.0000001.0000001.0000001.000000
Total_Avg0.9999920.9194470.9513500.935127
Table 13. Results of evaluation metrics for the CatBoost classifier on 92 datasets.
Table 13. Results of evaluation metrics for the CatBoost classifier on 92 datasets.
Model Dataset
(Daily)
AccuracyPrecisionRecallF1-Score
Dataset 10.9999950.8181821.0000000.900000
Dataset 20.9999930.8571431.0000000.923077
Dataset 30.9999940.9204551.0000000.958580
-------------------------
Dataset 911.0000001.0000001.0000001.000000
Dataset 921.0000001.0000001.0000001.000000
Total_Avg0.9999950.9297740.9874800.957759
Table 14. Overall comparison of the tested classifiers with two target classes.
Table 14. Overall comparison of the tested classifiers with two target classes.
ModelAccuracyPrecisionRecallF1-ScoreAUROCAUPRC
Logistic Regression0.8494440.0003760.9247000.0007510.9539140.014239
Random Forest0.9999940.9871500.9068150.9452780.9994610.982923
LightGBM0.9999960.9645260.9726350.9685640.9995590.987537
XGBoost0.9999920.9194470.9513500.9351270.9991450.963108
CatBoost0.9999950.9297740.9874800.9577590.9999880.977388
Table 15. Selection of decision threshold.
Table 15. Selection of decision threshold.
AccuracyPrecisionRecallF1-Score
LightGBM + Threshold 10.9999960.9645260.9726350.968564
LightGBM + Threshold 20.9999580.9665070.9742770.970376
LightGBM + Threshold 30.9999590.9691510.9722670.970706
Table 16. Variations of feature sets for feature selection techniques.
Table 16. Variations of feature sets for feature selection techniques.
Feature Sets
Variation1App information, App permissions, App service
Variation2App information, App permissions, App service, User activity
Variation3App information, App permissions, App service, User information
Variation4App information, App permissions, App service, App package name statistics
Variation5App information, App permissions, App service, User activity, App package name statistics
Our ProposedApp information, App permissions, App service, User activity, User information, App package name statistics
Table 17. Statistical measures of Accuracy, Recall, Precision, and F1-Score: Variation feature sets.
Table 17. Statistical measures of Accuracy, Recall, Precision, and F1-Score: Variation feature sets.
Accuracy
MinMaxMeanMedian25%75%
Variation10.999551 0.999976 0.999813 0.999817 0.999763 0.999875
Variation20.999646 0.999978 0.999901 0.999914 0.999875 0.999938
Variation30.999526 1.000000 0.999848 0.999869 0.999795 0.999905
Variation40.999615 1.000000 0.999873 0.999883 0.999841 0.999927
Variation50.999773 1.000000 0.999934 0.999948 0.999902 0.999975
Our Proposed0.999819 1.000000 0.999961 0.999972 0.999930 1.000000
Recall
MinMaxMeanMedian25%75%
Variation10.500000 1.000000 0.829375 0.847826 0.782269 0.888889
Variation20.500000 1.000000 0.879349 0.900980 0.833333 0.953611
Variation30.500000 1.000000 0.838819 0.848913 0.780382 0.905844
Variation40.500000 1.000000 0.846844 0.857143 0.797368 0.911275
Variation50.714286 1.000000 0.966149 0.973684 0.945571 1.000000
Our Proposed0.714286 1.000000 0.976437 1.000000 0.970129 1.000000
Precision
MinMaxMeanMedian25%75%
Variation10.333333 0.975000 0.819563 0.851648 0.750000 0.921559
Variation20.500000 1.000000 0.916453 0.950000 0.897222 0.968750
Variation30.400000 1.000000 0.882924 0.907670 0.857143 0.963294
Variation40.500000 1.000000 0.910078 0.948684 0.888889 0.972410
Variation50.625000 1.000000 0.932044 0.942017 0.901829 0.974519
Our Proposed0.714286 1.000000 0.961714 0.974359 0.944444 1.000000
F1-Score
MinMaxMeanMedian25%75%
Variation10.400000 0.986667 0.821788 0.852459 0.747685 0.895522
Variation20.500000 0.987013 0.895133 0.923077 0.857143 0.953416
Variation30.444444 1.000000 0.857345 0.878356 0.819141 0.918254
Variation40.500000 1.000000 0.875704 0.895522 0.841667 0.938776
Variation50.714286 1.000000 0.947998 0.957746 0.932323 0.982456
Our Proposed0.714286 1.000000 0.968545 0.978251 0.956522 1.000000
Table 18. Comparative analysis with existing work.
Table 18. Comparative analysis with existing work.
CategoryProposed WorkExisting WorkComparison
DatasetsReal-world dataset from Bank A (South Korea) with 183,938,730 benign transactions and 11,986 malware transactions; highly imbalancedMost existing works (e.g., DroidDetector [54]) used smaller, lab-controlled datasets, often from third-party app stores or Google PlayThe real-world, large-scale dataset offers better generalizability compared to smaller, simulated datasets used in existing work.
Detection MethodologyStatic analysis (feature extraction) for both known and unknown malware.Existing methods like DroidDetector [54] and Zhou et al. [59] rely primarily on dynamic analysis, with high computational overheads (e.g., TaintDroid [12])Static analysis in the proposed work is more scalable and resource-efficient for large operational environments.
Machine Learning ModelsTested models: Logistic Regression, Random Forest, LightGBM, XGBoost, CatBoost. LightGBM showed the best performance.Random Forest and SVM are commonly used but struggle with large, imbalanced datasets; some use deep learning, which is computationally expensiveLightGBM is faster, scalable, and better at handling imbalanced data compared to traditional models like SVM and Random Forest.
Performance MetricsAchieved 99.99% accuracy, 96.86% precision, 97.04% F1-score, AUROC of 0.999559, and AUPRC of 0.987537DroidDetector [54] had 96.76% accuracy and 92.85% F1-score; other models struggled with imbalanced datasetsThe proposed model shows significant improvements in F1-score and AUPRC, effectively handling imbalanced data with fewer false positives.
Real-World ApplicationDesigned for real-time deployment in financial institutions; uses a lightweight model (LightGBM) to handle large-scale dataExisting models such as TaintDroid [12] and DroidDetector [54] are not scalable or practical for real-time large-scale environmentsThe proposed framework is more scalable and applicable to real-world financial institutions compared to existing methods.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Shin, J.; Kim, D.; Lee, K. Advanced Financial Fraud Malware Detection Method in the Android Environment. Appl. Sci. 2025, 15, 3905. https://doi.org/10.3390/app15073905

AMA Style

Shin J, Kim D, Lee K. Advanced Financial Fraud Malware Detection Method in the Android Environment. Applied Sciences. 2025; 15(7):3905. https://doi.org/10.3390/app15073905

Chicago/Turabian Style

Shin, Jaeho, Daehyun Kim, and Kyungho Lee. 2025. "Advanced Financial Fraud Malware Detection Method in the Android Environment" Applied Sciences 15, no. 7: 3905. https://doi.org/10.3390/app15073905

APA Style

Shin, J., Kim, D., & Lee, K. (2025). Advanced Financial Fraud Malware Detection Method in the Android Environment. Applied Sciences, 15(7), 3905. https://doi.org/10.3390/app15073905

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop