Advanced Financial Fraud Malware Detection Method in the Android Environment

Shin, Jaeho; Kim, Daehyun; Lee, Kyungho

doi:10.3390/app15073905

Open AccessReview

Advanced Financial Fraud Malware Detection Method in the Android Environment

by

Jaeho Shin

^1,2

,

Daehyun Kim

^1,2

and

Kyungho Lee

^1,*

¹

School of Cybersecurity, Korea University, Seoul 02841, Republic of Korea

²

Ministry of Information Security, Hana Bank, Seoul 04523, Republic of Korea

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(7), 3905; https://doi.org/10.3390/app15073905

Submission received: 12 February 2025 / Revised: 25 March 2025 / Accepted: 27 March 2025 / Published: 2 April 2025

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

The open-source structure and ease of development in the Android platform are exploited by attackers to develop malicious programs, greatly increasing malicious Android apps aimed at committing financial fraud. This study proposes a machine learning (ML) model based on static analysis to detect malware. We validated the significance of private datasets collected from Bank A, comprising 183,938,730 and 11,986 samples of benign and malicious apps, respectively. Undersampling was performed to adjust the proportion of benign applications in the training data because the data on benign and malicious apps were unbalanced. Moreover, 92 datasets were compiled through daily training to evaluate the proposed approach, with benign app data updated over 70 days (D-70 to D-1) and malware app data cumulatively aggregated to address the imbalance. Five ML algorithms were used to evaluate the proposed approach, and the optimal hyperparameter values for each algorithm were obtained using a grid search method. We then evaluated the models using common evaluation metrics, such as accuracy, precision, recall, F1-Score, etc. The LightGBM model was selected for its superior performance, achieving high accuracy and effectiveness. The optimal decision threshold for determining whether an application was malicious was 0.5. Following re-evaluation, the LightGBM model obtained accuracy and F1-Score values of 99.99% and 97.04%, respectively, highlighting the potential of using the proposed model for real-world financial fraud detection.

Keywords:

Android; financial fraud; malware; static analysis; machine learning; unbalanced data; hyperparameters

1. Introduction

Google Play Store offers over 2.85 million applications (apps) [1]. The number of available apps surpassed one million in July 2013 and reached 2.43 million in the fourth quarter of 2023 [2]. These apps provide users with services, such as online shopping, gaming, finance, health, social networking, location tracking, and navigation. According to a report by the International Data Corporation, 326.1 million mobile phones were shipped in the fourth quarter of 2023 [3], and phones based on open-source Android comprise 70.1% of the smartphone market [4].

The openness and extensive adoption of Android make it a crucial target for malicious attackers [5]. A tremendous amount of Android malware has been created and spread and has been used in various illegal activities, including stealing personal information from devices, damaging systems, and phishing. Moreover, unlike the Play Store, untrustworthy app stores allow users to indiscriminately download apps without safety mechanisms, readily exposing them to risk from attackers. The urgency and importance of malware detection have increased, particularly as mobile financial services such as mobile banking and electronic payments have become popular and widespread [6].

New financial fraud malware is often crafted to bypass antiviral software and stealthily infiltrate smartphones. These attacks predominantly focus on Android users, who enjoy considerable freedom in development and application usage. Users who lack awareness of common phishing scams and malware prevention solutions on their phones are the most vulnerable to malware attacks [7]. Financial fraud losses in South Korea, totaling 145.1 billion KRW in 2022, increased by 35.4% to 196.5 billion KRW in 2023, with countless users remaining at risk of financial fraud [8]. Allix et al. [9] identified 22% of the apps on the Google Play Store and 50% of those on App China as malware apps.

Mobile malware is constantly evolving with new features designed to evade detection by malware scanners. Android malware applications typically employ three principal techniques to compromise user devices: repair, update, and download [10].

Various malware detection models have been proposed to protect consumers’ personal information and ensure economic security. These studies can be broadly divided into three types: static [11], dynamic [12], and hybrid [13] analyses. Static analysis enables malware detection without executing an application. Consequently, mobile devices are unaffected by malicious malware [14]. Static analysis provides a secure approach to malware detection. Conversely, dynamic analysis requires the execution of an application within an isolated environment. Owing to the limitations of the operational environment, a dynamic analysis must be performed at runtime to detect suspicious behaviors. However, as the number of Android apps has increased exponentially, the efficiency of dynamic analysis methods to identify malware has decreased. Finally, hybrid analysis combines static and dynamic methods. We propose a static-analysis-based method to identify financial fraud malware. Specifically, static analysis enables the rapid scanning of large applications by examining code or APK files without executing malicious code. It can also identify malicious behavior patterns through an in-depth analysis of the APK structure, such as permission requests and application programming interface (API) calls, and detect malware using specific code patterns or function calls. Additionally, it offers the benefit of a pre-execution review, which ensures that malware can be detected before it is activated. Finally, static analysis conserves resources because it does not require real-world execution environments, making it ideal for efficiently scanning many apps. Our study focuses on malware types that are highly associated with financial fraud. These include billing fraud, stalkerware, hostile downloaders, phishing, and spyware (Figure 1) [15].

Billing fraud: Malicious apps that exploit payment systems without user consent, often through Android payment methods or unauthorized transactions [16].
Stalkerware: Software installed on a user device without knowledge, enabling third parties to track location, monitor communication, and access personal data [17].
Hostile downloaders: Code that does not directly cause harm but downloads other unwanted or malicious software onto the device [18].
Phishing: These apps disguise themselves as trusted sources, tricking users into submitting personal and billing information, which is then sent to malicious third parties [19].
Spyware: Malicious software that monitors the online activity of a user and collects sensitive information, such as login credentials, banking details, and private data without the consent of the user [20].

In this paper, we present a machine learning (ML) model that leverages static analysis to detect financial fraud malware. ML is well suited for this task because it excels in identifying intricate patterns and generalizing from past data, making it ideal for recognizing evolving malware strategies. Static analysis, which involves examining the code and structure of an app without execution, provides an efficient and safe method to extract features such as permissions, application programming interface (API) calls, and code patterns. The proposed ML model uses static analysis to rapidly analyze large datasets of apps and detect financial fraud malware before execution, thereby minimizing security risks. This approach combines scalability with the ability to detect new and previously undiscovered malware strains. The main contributions of this study are as follows:

Static Analysis for Financial Fraud Malware Detection: This study proposes a novel approach utilizing static analysis for detecting financial fraud malware. Compared to dynamic analysis, static analysis is more efficient for real-time detection in operational environments.
Real-World Financial Data Utilization and Feature Engineering: The research leverages real-world financial fraud malware data from Bank A in South Korea. Seventy-one features were extracted from eight feature sets, and the accuracy of distinguishing between benign and malicious apps was enhanced by incorporating unique features such as “User activity”, “User information”, and “App package name statistics”, which differentiate this study from previous research.
Extensive ML Model Comparison and Optimization: Multiple machine learning models, including logistic regression, random forest, LightGBM, XGBoost, and CatBoost, were evaluated. Optimized hyperparameters were determined using a grid search to enhance model performance.
Scalable and High-Performance Malware Detection for Financial Institutions: The LightGBM model achieved an accuracy of 99.99% and an F1-score of 97.04%, demonstrating its effectiveness as a scalable, real-time cybersecurity solution for financial institutions.

The remainder of this paper is organized as follows: Section 2 provides an overview of related work. Section 3 describes the proposed methodology. Section 4 covers the experiments and results and Section 5 concludes the study.

2. Related Work

Various ML techniques have been developed to detect or prevent Android malware. These methods apply static and dynamic analyses, hybrid analysis, and feature extraction.

2.1. Static Analysis-Based Malware Detection

Numerous researchers have explored using static analysis and feature-based methods for detecting Android malware. Table 1 presents a comparison of these studies.

In [21], the authors examined Android applications that detect hidden patterns in highly sensitive malware. An automated malware detection system, MalPat, was introduced and tested on a dataset containing 3185 benign apps and 15,336 malware samples obtained from Google Play Store [22], VirusShare [23], and Contagio [24]. MalPat achieved a remarkable F1-score of 98.24%.

In [25], an ML approach utilizing logistic regression on permissions and API features was used to detect Android malware. The model was evaluated using two datasets obtained from RPAGaurd [26] and Google Play: a full-feature dataset and a reduced-feature dataset containing 131 features. The model achieved accuracies of 97.25% and 95.87% with full- and reduced-feature datasets, respectively.

In [27], the authors proposed DeepClassifyDroid, a deep learning-based system for Android malware detection. The system employs a three-step process: feature extraction, feature embedding, and detection using a convolutional neural network (CNN) on datasets obtained from Drebin [28] and Chinese app markets. Initially, it performs a comprehensive static analysis and generates five distinct feature sets. Finally, DeepClassifyDroid uses a CNN to detect malware. It outperformed most conventional ML methods, achieving a 97.4% detection rate without false alarms. Furthermore, it was 10 times faster than linear support vector machine (SVM) models and 80 times faster than k-nearest neighbor (KNN) models. In [29], the performance of the KNN, a simple yet effective ML classifier, was evaluated by examining various distance measures and hyperparameters. Specifically, the study involved extensive experiments on the Drebin [28] dataset and comparisons of multiple well-known distance metrics. The findings revealed that selecting appropriate distance measures significantly affected classification accuracy. The authors argued that the commonly used Euclidean distance is not optimal for mobile malware detection, and alternative metrics, such as Hamming and CityBlock distances, can improve the classification performance.

In [30], a malware prevention system using an “end-to-end deep-learning architecture” was proposed. The system detects Android malware and assigns attributes based on the opcodes extracted from the malware. The authors of the study demonstrated that a bidirectional long short-term memory (BiLSTM) neural network outperformed state-of-the-art models in detecting the static behavior of Android malware without using the features developed by other models. Their model demonstrated exceptional performance, achieving an accuracy of 99.90% and an F1-score of 99.60% on a large dataset of more than 1.8 million Android apps obtained from the Android malware dataset (AMD) [31], Drebin [28], and VirusShare [23].

In [32], DANdroid, an Android malware detection model, leveraged a discriminative adversarial network (DAN). The contributions of the study were threefold: (1) the proposed method can effectively differentiate adversarial learning outcomes from malware feature representations; (2) it employs a Multi-view deep learning architecture with three feature sets (raw opcodes, permissions, and API calls) to enhance resistance to obfuscation; and (3) the approach demonstrates the ability to generalize to future obfuscation techniques that were not present during model training. The model attained an average F1-score of 97.3% when evaluated using the Drebin [28] dataset.

In [33], the authors proposed a method that extracts the most significant features of Android apps using static analysis supplemented by two newly introduced features. These features were subsequently processed using a functional API-based deep-learning model. The method was evaluated using a newly curated dataset of Android applications comprising 14,079 malware and benign programs obtained from Virus Total [34], AMD [31], MalDozer, and the Contactio Security Blog. Malware samples were organized into four categories. Two experiments were conducted on this dataset. The first focused on binary classification, separating the samples into malware and benign classes. In the other experiment, malware detection was performed to classify the dataset samples into five classes. Their proposed method achieved F1 scores of 99.5% and 97% in classification experiments that considered two and five classes, respectively.

In [35], an advanced and reliable malware detection system that utilizes deep learning algorithms was developed. The proposed system evaluated recurrent neural network (RNN)-based methods, including LSTM, BiLSTM, and gated recurrent unit (GRU), on the CICInVesSandMal2019 [36] dataset, which comprised 8115 static features for malware detection. The BiLSTM model outperformed the other RNN-based approaches, achieving an accuracy of 98.85% and an F1-score of 98.21%.

In [37], a permission-based malware detection system called PerDRaML was proposed. It determined whether an app was malicious based on suspicious permission usage. The system utilized a multistep methodology to extract and identify important features, including permissions, app size, and permission ratios, from a manually collected dataset of 10,000 apps. Moreover, the authors of the study used various ML models to classify apps as malicious or normal and successfully identified the five most important features to predict malicious apps through extensive experimentation. They achieved high malware detection accuracies of 89.7%, 89.96%, 86.25%, and 89.52% using SVM, random forest, rotation forest, and Naive Bayes approaches, respectively, thereby outperforming conventional techniques.

2.2. Dynamic Analysis-Based Malware Detection

This section reviews several notable studies that employ dynamic analysis for Android malware detection. Table 2 presents a comparison of these studies.

In [38], a model was developed for the dynamic analysis of Android applications to detect malware by monitoring the Android API and system calls. The model was evaluated using datasets from the MalGenome Project [39] and VirusShare [23], comprising 7520 apps. Using various classification algorithms, the model achieved an accuracy of 96.66%. However, it exhibited limitations in detecting certain types of malicious behavior. For instance, it does not gather information on the behavior of apps that operate offline and halts execution in such cases.

In [40], a method that relies solely on application runtime behavior was proposed to classify Android malware into families using SVM and conformal prediction techniques. The approach was evaluated on the Drebin dataset [28], using system calls and binder communication features. The proposed approach achieved 94% accuracy for testing apps; however, it encountered difficulties because it tracked only low-level events, which resulted in a collection of limited information and insufficient application coverage. DroidCat, a dynamic classification model for Android apps, was proposed in [41]. The model utilizes various dynamic features such as intents, method calls, and intercomponent communication. DroidCat was tested on a comprehensive dataset of 34,343 apps collected from MalGenome [39], AndroZoo [42], VirusShare [23], and Drebin [28], achieving an accuracy of approximately 97%. The authors highlighted that some dynamic features, such as the distribution of method calls within the source code and the app execution structure that captures library characteristics are more influential in the classification process than features such as sensitive flows.

In [43], system calls were extracted from the CIC-ANDMAL2017 dataset [44] and experiments were conducted using different ML algorithms. It was reported that the KNN and decision tree algorithms achieved F1-scores of 85% and 72% for malware detection and family classification, respectively. In [45], a parallel ML model was developed that incorporated various classifiers, such as J48, KNN, SVM, and random forest, to detect and classify Android malware using dynamic features. Standard ML classifiers were implemented to detect Android malware categories and families by analyzing a large malware dataset containing 14 major malware categories and 180 well-known malware families from CCCS-CIC-AndMal (2020) [46] in the dynamic layer. The authors conducted experiments using several ML algorithms and compared their proposed model with that of the most recent relevant study. Their proposed model achieved accuracies of 96.89% and 99.65 in detecting Android malware categories and families, respectively.

2.3. Hybrid Analysis-Based Malware Detection

This section reviews several studies that have applied hybrid features, combining static and dynamic characteristics for Android malware detection. Table 3 presents a comparison of the studies discussed herein.

In [47], StormDroid, a streamlined ML-based malware detection model that adopts a hybrid approach, was used. It leveraged various features, including permissions, sensitive API calls, sequences, and dynamic behaviors to classify Android malware. The model was evaluated using multiple ML algorithms, such as SVM, C4.5, multi-layer perceptron (MLP), naïve bayes (NB), instance-based k (IBK), and Bagging predictors, on a dataset of approximately 8000 apps collected from Google Play [22] and Contagio [24]. StormDroid enables large-scale analysis by monitoring both static and dynamic behaviors and demonstrates an accuracy of 93.8%. Additionally, it achieved approximately three times the efficiency of the single-threaded models.

MADAM, a cross-layer model that incorporates static and dynamic features such as system calls, short message service (SMS) messages, critical APIs, user activities, and app metadata, was introduced in [48]. The model was evaluated using various algorithms, including KNN, linear discriminant classification (LDC), quadratic discriminant classifier (QDC), MLP, parametric zero-cost proxies (PARZC), and radial basis function (RBF), on large datasets sourced from MalGenome [39], Contagio [24], and VirusShare [23], achieving an accuracy of 96.6%. Additionally, MADAM is designed to minimize battery consumption and imposes a low-performance overhead.

In [49], Androtomist, an open-source tool, was designed for the static and dynamic analyses of Android applications. The tool offers two operational modes: a novice mode for beginners and an expert mode for advanced users. Androtomist was tested on three datasets using various ML classifiers, with an ensemble approach applied to average the results and determine the most significant features. It achieved a perfect accuracy of 100% on the Drebin [28] and VirusShare [23] datasets, whereas its accuracy was slightly lower at 91.8% on the AndroZoo [38] dataset.

Another study [50] proposed a deep learning model for detecting Android malware using a hybrid analysis approach. The model utilized the MalGenome [39] and Drebin [28] datasets for static analysis and the CICMalDroid2020 [51] dataset for dynamic analysis. In total, 261 combined features were extracted for the hybrid analysis. The model was further evaluated using 311 application samples comprising 165 benign apps from the Play Store and 146 malicious apps from VirusShare [23]. The proposed model achieved an accuracy of 99.36%. The authors highlighted that a hybrid analysis approach enhanced the detection rate by approximately 5%.

In [52], a hybrid malware detection method was proposed using tree-augmented Naïve Bayes (TAN) to analyze the conditional dependencies between key static and dynamic features, including API calls, permissions, and system calls required for application functionality. The approach involved training three ridge-regularized logistic regression classifiers to analyze API calls, permissions, and system calls, and employing TAN to model the relationships between these outputs to identify malicious applications. The proposed method achieved sustained detection accuracy of 97% over an extended period.

3. Proposed Malware Detection Approach

The malware detection framework presented in Figure 2 comprises several critical steps. Preprocessing tasks, such as feature extraction, managing missing values, and calculating the undersampling ratio, were performed to mitigate the imbalance between benign and malicious applications [54,55]. Following the daily configuration of the training dataset, the optimal hyperparameters for each ML algorithm were determined using established methods. The algorithms were then evaluated using metrics such as accuracy, precision, recall, and F1-Score and by plotting the area under the precision-recall curve (AUPRC) and the area under the ROC curve (AUROC) diagrams [56]. Based on these evaluations, the optimal decision threshold was selected for the best-performing algorithm [57], and the final malware detection model was applied. Furthermore, malware detection includes classification into five categories: Billing Fraud, Stalkerware, Hostile Downloaders, Phishing, and Spyware. However, in this study, experiments were conducted using only two classes: Benign and Malware. The steps are detailed below.

3.1. Data Collection

In this study, we used real-time data collected by Bank A in South Korea from April 2023 to September 2023 (Figure 3). The fraud detection system (FDS) gathered information on six apps (Appinfo) from the Android smartphones of customers who had installed the mobile banking app of Bank A. Specifically, the system collected data on three apps from verified sources (such as Google Play Store) and three apps from unverified sources (such as third-party stores or APK files) [58]. These apps were installed five days before mobile banking login by the customer. This distinction between trusted and untrusted sources is critical because apps from unverified sources are more vulnerable to malware [59,60,61]. The collected data were subsequently analyzed to detect malicious apps by applying a blacklist or feature pattern analysis method.

The actual data contained information on 183,938,730 and 11,986 transactions for the benign and malware apps, respectively. For malware app information, an FDS staff member of Bank A spoke with customers over the phone and checked the information on the apps installed on their smartphones to determine whether a malicious app was installed. Table 4 and Table 5 present the details of the raw data collected.

3.2. Feature Extraction

Upon obtaining the data, the features for model training were extracted. We analyzed the raw data and added two categories to the six feature sets listed in Table 6: user information and app package name statistics. The selection of “user information” and “app package name statistics” as additional feature sets was based on their relevance in enhancing the detection of financial fraud malware. “User information”, which includes demographic details such as age group and sex, can provide valuable insights into risk patterns associated with malware infections. Certain demographics may be more vulnerable to specific types of malware or targeted attacks. Additionally, “app package name statistics” help identify suspicious or uncommon app names, often used by malicious apps to masquerade as legitimate ones. These features complement traditional static analysis techniques and improve the accuracy of the model in distinguishing between benign and malicious apps. Consequently, eight feature sets were obtained. Moreover, 57 variables were added to the 14 listed in Table 4 to extract 71 features. Table 5 and Table 6 list the final selected feature sets.

Target: The “label” in the raw data to determine if an app is malicious. This was used as the dependent variable to train the model.
App information: “App package name” and “app size” were selected from the raw data. Some information in “app size” was missing; therefore, a derived variable called “app size Null indicator” was added. “App name” was excluded from this study because it requires further research on natural language processing as it contains data in various languages, including English, Korean, Chinese, and Japanese.
App permissions: These were extracted from the manifest.xml file. Permission information is a key element in determining whether an app is malicious, as evidenced by several related studies on static analysis of malware apps [10,25,27,29,33,35,37]. From the information on numerous permissions, this study utilized that on “read SMS permission”, “write SMS permission”, “process outgoing calls permission”, “call phone permission”, “request install packages permission”, and “manage external storage permission?”.
App service: A service is a background component that runs independently of the user interface [62]. The services were also extracted from the manifest.xml file. In this study, only services that could determine the default phone app indicators were used.
User activity: Classification of elements changed by user activity. We utilized the “App source” and added “app source verification indicator”, “elapsed time after installation”, and “elapsed time after installation Null indicator” as derived variables. Most malicious applications that steal personal information are installed through illegal URLs rather than through publicly available app stores. Hence, “app source” information is crucial in determining whether an app is malicious.
User information: The age range and sex of users who install malicious applications are considered important factors. Hence, customer information from Bank A was combined with the raw data to add “age”, “age Null indicator”, and “sex” as additional variables.
App package name statistics: According to the analysis results of the app package names in the raw data, 99% of the app package names contained six or fewer words. The following derived variables were created and added to learn the package name patterns: the length of each word up to the sixth word separated by “?”. for the entire package name string, the number of words, the standard deviation of word length, the ratio of numbers, the ratio of vowels, the ratio of consonants, the ratio of maximum consecutive numbers, the ratio of maximum consecutive vowels, and the ratio of maximum consecutive consonants, respectively.

3.3. Data Preprocessing

In the raw data, the numerical variables “app size”, “elapsed time after installation”, and “age”, and the categorical variables “App permissions” and “App service” contained negative or missing values. Thus, the following preprocessing was performed to handle missing values.

App size (app_sz): If a value is missing, it is replaced with 0, and the derived variable “app size Null indicator (app_sz_yn)” is created. If “app_sz” is missing, “app_sz_yn” is set to 1, and if not, it is set to 0.
Elapsed time after installation (ist_af_psg_drtm): If a value is negative or missing, it is replaced with 0, and the derived variable “elapsed time after installation (ist_af_psg_drtm_yn)” is created. If “ist_af_psg_drtm” is missing, “ist_af_psg_drtm_yn” is set to 1, and if not, it is set to 0.
Age (age): If a value is negative or missing, it is replaced with −1, and the “age Null indicator (age_yn)” is created as an additional variable. If “age” is missing, “age_yn” is set to 1, and if not, it is set to 0.
App permissions: If the app permissions (read_sms to manage_external_storage) have missing values, they are replaced with “?”. For each permission value, set it to 1 if it is “Y”, 0 if it is “N”, and 1 if it is “?”.
App service: If the app service (basc_phone_app) is missing a value, it is replaced with “?”. If the permission value is “Y” or “?”. The value was set to 1; if the permission value was “N”, it was set to 0.

3.4. Dataset

To train and evaluate the raw data collected over six months from April 2023 to September 2023, 92 daily analysis datasets were constructed from 1 April 2023 to 30 September 2023, to evaluate each algorithm. For each analyzed dataset, the training dataset for benign apps was configured for daily training for 70 days, labeled from D-70 to D-1 (Figure 4). For the training dataset for malware apps, the data were cumulatively aggregated from April to resolve the data imbalance. For the test dataset, the evaluation was performed 92 times from July 1 to the evaluation date (Day D).

The ratio of benign app data to malicious app data for the “Train” dataset exceeded 10,000:1, causing an imbalance between classes. Such data imbalance can cause performance issues in ML algorithms and affect the evaluation results (Table 7). Therefore, several studies have used undersampling and oversampling methods to resolve data imbalances [63]. In this study, undersampling was employed to address the significant imbalance between benign and malicious app data because the proportion of benign app data was much larger. This method was chosen for its advantages in improving computational efficiency, enhancing the focus of the model on the minority class (malicious apps), and maintaining data authenticity. Unlike oversampling or SMOTE [64], undersampling minimizes the risks of overfitting and noise, making it a more effective approach, given the characteristics and objectives of the dataset. Hence, we added a logic to sample 10% of the “Train(Benign)” app data. A sampling rate of 10% was determined based on the following experimental process: The test dates were sampled (08/11, 08/16, 08/24, and 08/28 in 2023), and the sampling rate of the “Train(Benign)” app data was adjusted between 2% and 30%. Based on the AUPRC, model performance stabilized when 10% or more of the benign app data were sampled.

3.5. Algorithms

In this study, we performed evaluations by selecting major ML algorithms commonly used to evaluate classification models. A grid search method was used to calculate the optimal hyperparameter values to improve the performance of each algorithm (Table 8).

3.5.1. Logistic Regression

Logistic regression is a predictive analysis technique used to determine the relationship between two or more variables. It examines whether a binary dependent variable is associated with one or more independent variables, which can be ordinal, nominal, interval, or ratio-level [65]. The adjusted hyperparameters include “max_iter”, “class_weight”, “fit_intercept”, “regularization strength”, and “penalt?”.

3.5.2. Random Forest

Random forest is an ensemble learning algorithm that builds multiple decision trees and aggregates their outputs to produce more accurate predictions. When training each decision tree model, the random forest uses the bagging method to train individual decision tree models using a dataset sampled from the entire training dataset, allowing duplicates. The values predicted by these models were averaged to produce the final prediction. The bagging method improves the generalization performance of the prediction model [66]. The adjusted hyperparameters include “n_estimators”, “bootstrap”, “class_weight”, and “max_dept?”.

3.5.3. LightGBM

The LightGBM algorithm is built on a Gradient Boosting Decision Tree (GBDT) framework designed to enhance computational efficiency, especially for large-scale data prediction tasks. This high-performance algorithm can quickly process and distribute vast amounts of data. LightGBM achieves faster training and lower memory usage by using a histogram-based approach and leafwise growth strategy with a maximum depth constraint for trees [67]. The adjusted hyperparameters include “n_estimators”, “learning_rate”, “max_depth”, “class_weight”, “balanced”, “max_delta_step”, “num_leaves”, “colsample_bytree”, “subsample”, “objective”, “binary”, and “boost_from_averag?”.

3.5.4. XGBoost

XGBoost is a powerful ML algorithm that builds decision trees using gradient boosting. XGBoost is favored over other gradient-boosting machines owing to its fast execution, superior model performance, and efficient use of memory resources. This hybrid approach incrementally adds models to correct errors from previous models. It leverages parallel computations to utilize all the available CPUs for tree construction during training. XGBoost enhances computational efficiency and speed by using the “maximum depth” parameter instead of traditional stopping criteria and employs backward tree pruning. Additionally, it incorporates a regularization technique, known as “formalization”, to mitigate overfitting and improve overall performance [68]. The adjusted hyperparameters include “n_estimators”, “learning_rate”, “scale_pos_weight”, and “max_dept?”.

3.5.5. CatBoost

CatBoost is an advanced gradient-boosting algorithm introduced by Prokhorenkova et al. [69]. It is highly effective at handling imbalanced datasets and serves as a strong contender for classification algorithms. CatBoost is a specialized form of gradient boosting for decision trees capable of processing categorical and ordered features while mitigating overfitting through a Bayesian estimator. Unlike many other ML models, CatBoost requires minimal training effort and is versatile across various data types and formats. It supports both central processing unit (CPU) and graphics processing unit (GPU) implementations, with the GPU version enabling significantly faster training compared to other leading GBDT implementations, such as XGBoost and LightGBM, for similarly sized ensembles. CatBoost employs an efficient approach to reduce overfitting, thereby enabling the use of an entire dataset for training [70]. The adjusted hyperparameters include “iterations”, “learning_rate”, “subsample”, “scale_pos_weight” and “dept?”.

3.6. Construction of the Proposed Model

Figure 5 illustrates the construction of the proposed model for transferring various types of features.

3.7. Evaluation Metrics

We evaluated the performance of the proposed model using standard metrics, including accuracy, precision, recall, and F1-score. These metrics were computed using the following formulae:

A c c u r a c y = \frac{(T P + T N)}{(T P + T N + F P + F N)}

(1)

P r e c i s i o n = \frac{T P}{(T P + F P)},

(2)

R e c a l l = \frac{T P}{(T P + F N)},

(3)

F 1 - S c o r e = 2 \times \frac{(P r e c i s i o n \times R e c a l l)}{(P r e c i s i o n + R e c a l l)} .

(4)

True positives (TP) represent the number of applications correctly identified as malicious. True negatives (TN) refer to the number of applications correctly identified as benign. False positives (FP) are benign applications incorrectly classified as malicious. False negatives (FN) are malicious applications incorrectly classified as benign.

Accuracy measures the overall performance of the classifier and represents the proportion of correct predictions made by the model. However, this is not a reliable metric for imbalanced datasets because it can produce high values even if the model correctly identifies only a single malware app. Recall evaluates the effectiveness of a classifier in detecting malware applications, whereas precision assesses the reliability of the predictions of a classifier. The F1-score, calculated as the harmonic mean of the recall and precision, provides a balanced evaluation by accounting for FNs and FPs.

The AUROC evaluates the ability of the model to distinguish between classes by measuring separability. This is represented as a graphical plot of the FP rate against the TP rate at various thresholds.

The precision-recall curve is commonly used to assess classifiers based on their precision and recall. Generally, precision is plotted on the y-axis, and recall is plotted on the x-axis, creating a two-dimensional graph for comparison.

4. Experimental Results and Discussion

This section details the testing and evaluation of various ML algorithms generally used in previous studies on the 92 datasets discussed earlier. The best-performing algorithm was selected based on model evaluation metrics. Subsequently, an optimal decision threshold was chosen. Thereafter, a comparative evaluation was conducted by varying the training cycle and feature values to verify the superiority of the final model. Finally, the importance of each feature used to train the model was examined.

4.1. Model Evaluation

This section evaluates the logistic regression, random forest, LightGBM, XGBoost, and CatBoost models, which are ML classifiers, on the 92 datasets discussed earlier.

4.1.1. Evaluation of Logistic Regression

Table 9 presents the results of the Logistic Regression classifier evaluation for the datasets. The overall average results of the evaluation for each dataset were as follows: accuracy, 84.94444%; precision, 0.0376%; recall, 92.47%; and F1-Score, 0.0751%. In general, accuracy remains low. The percentage of malware app data that were not detected as malicious was 7.53%, whereas the percentage of benign app data that were not detected as benign was 15.06%. Hence, the precision was noticeably lower than the recall.

Figure 6 illustrates the confusion matrix for the two classes of benign and malware applications. The confusion matrix was predicted to be in the range of 84.9439% to 92.4700%, and the malware app class was predicted to be higher than the benign app class.

The ROC curve for the logistic regression method is presented in Figure 7, with an AUROC value of 0.953914. Figure 8 illustrates the precision–recall curve, where the AUPRC was calculated as 0.014239.

4.1.2. Evaluation of Random Forest

Table 10 presents the evaluation results of the random forest classifier for the 92 datasets. The overall average values of the evaluation results for each dataset are as follows: accuracy, 99.9994%; precision, 98.7150%; recall, 90.6815%; and F1-Score, 94.5278%. In general, the accuracy is high. The percentage of malware app data not detected as malicious was 9.32%. Therefore, recall was evaluated to be noticeably lower than precision.

Figure 9 illustrates the confusion matrix for the two classes of benign and malware apps. The confusion matrix was predicted to be in the range of 90.6815–99.9999%, and the benign app class was predicted to be higher than the malware app class.

The AUROC for the random forest method is plotted in Figure 10 and has a value of 0.999461. Figure 11 illustrates the precision–recall curve; the AUPRC was 0.982923.

4.1.3. Evaluation of LightGBM

Table 11 presents the evaluation results of the LightGBM classifier for each dataset. The overall average results for the datasets are as follows: accuracy, 99.9996%; precision, 96.4526%; recall, 97.2635%; and F1-Score, 96.8564%. Accuracy was evaluated to be high, and precision and recall were evaluated to be balanced without significant differences.

The confusion matrix was predicted to be in the range of 97.2635–99.9998% (Figure 12). Both benign and malware app classes were predicted to be high.

The AUROC for the LightGBM method is plotted in Figure 13 and has a value of 0.999559. Figure 14 illustrates the precision–recall curve, and the AUPRC value was 0.987537.

4.1.4. Evaluation of XGBoost

Table 12 presents the evaluation results of the XGBoost classifier for each dataset. The overall average results for the datasets are as follows: accuracy, 99.9992%; precision, 91.9447%; recall, 95.1350%; and F1-Score, 93.5127%. The accuracy was evaluated as high, and the precision was slightly lower than the recall.

As illustrated in Figure 15, the confusion matrix was predicted to range from 95.1350 to 99.9995%. The benign application class is predicted to be slightly higher than the malware app class.

The AUROC for the XGBoost method is plotted in Figure 16 and has a value of 0.999145. Figure 17 presents the precision–recall curve, and the AUPRC value was 0.963108.

4.1.5. Evaluation of CatBoost

Table 13 presents the evaluation results of the CatBoost classifier for 92 datasets. The overall average results for the datasets were as follows: accuracy, 99.9995%; precision, 92.9774; recall, 98.7480%; and F1-Score, 95.7759%. The accuracy of this method was high. Moreover, although the precision was high, the recall was relatively low.

The confusion matrix was predicted to be in the range of 98.7480–99.9995% (Figure 18). Both benign and malware app classes were predicted to be high.

The AUROC for the CatBoost method is plotted in Figure 19 and has a value of 0.999988. Figure 20 illustrates the precision–recall curve, and the value of the AUPRC was 0.977388.

4.2. Selected Model and Decision Threshold

Table 14 presents a comparison of the evaluation results for each model. The main goal of this study is to accurately detect and classify benign and malware apps to ensure the effectiveness of the model in real-world operational environments. Upon evaluating various classifiers, LightGBM was selected as the most suitable model owing to its exceptional performance across key metrics. LightGBM achieved the highest accuracy (0.999996) and best F1-Score (0.968564), demonstrating an excellent balance between precision and recall, which is essential for handling imbalanced datasets. It also recorded the highest AUPRC (0.987537), highlighting its superior ability to differentiate between benign and malicious apps even in challenging scenarios. Compared with other models, such as random forest and CatBoost, LightGBM consistently balanced the precision, recall, and F1-Score more effectively. Logistic Regression, although computationally simple, struggles significantly with imbalanced data, making it impractical for this task. LightGBM was chosen for its outstanding detection performance and computational efficiency, making it ideal for deployment in resource-constrained or real-time environments. Its scalability and robust generalization across various evaluation metrics further solidified its position as the best choice, outperforming traditional models and advanced gradient-boosting models, such as XGBoost and CatBoost. In summary, LightGBM proved to be the most practical and effective solution for this application.

Additional experiments were conducted to further enhance the performance of LightGBM and determine the optimal decision threshold. The experimental results are presented in Table 15. Based on these results, we selected 0.5 as the optimal decision threshold value. A fixed decision threshold (0.5) was ultimately selected for its ability to balance precision and recall, simplicity of implementation, and suitability for operational environments. Although alternative thresholds, such as 0.603 or dynamic values, offer potential benefits for specific use cases, their added complexity or limited incremental improvements make them less practical for general applications. The final results of our proposed malware detection model exhibited an accuracy of 99.9958%, a precision of 96.6507%, a recall of 97.4277%, and an F1-Score of 97.0376.

Threshold 1: Threshold that always changes according to the model and date. For each test date, D, the model was fitted from D-70 to D-15. Subsequently, the threshold that produced a high F1-Score for the benign and malware samples from D-14 to D-1 was selected.
Threshold 2: The default threshold value was fixed at 0.5.
Threshold 3: The threshold was fixed at a single value of 0.603, which produces high F1-Scores for all 92 datasets measured ex post facto (for reference).

4.3. Comparative Analysis and Statistical Validation of Proposed Model

To validate the performance of our proposed financial fraud malware detection model, we conducted a comprehensive statistical comparison using five feature set variations (Variation1–Variation5) and the proposed model (Table 16). The analysis was performed across four key performance metrics: accuracy, recall, precision, and F1-Score. This section presents the statistical analysis and visualizations using a one-way Analysis of Variance (ANOVA) and box plots to demonstrate the superiority of our proposed model.

To assess the statistical significance of the performance differences across the six feature sets, we applied one-way ANOVA for each performance metric. The null hypothesis (H₀) assumes that the mean performance is the same across all feature sets, while the alternative hypothesis (H₁) assumes that at least one feature set performs significantly better than the others [64]. The ANOVA results were as follows:

Accuracy: F-Statistic = 56.22, p-value = 3.91 × 10⁻⁴⁷
Recall: F-Statistic = 58.19, p-value = 1.60 × 10⁻⁴⁸
Precision: F-Statistic = 25.25, p-value = 6.16 × 10⁻²³
F1-Score: F-Statistic = 43.92, p-value = 4.42 × 10⁻³⁸

Figure 21 and Figure 22 present histograms of F-statistics and p-values for different metrics. As the p-values for all the metrics were significantly smaller than 0.05, we reject the null hypothesis. This confirms that statistically significant differences existed in the performance of the feature sets for all metrics. Table 17 and Figure 23 present the box plots generated to provide a visual representation of the variability and central tendencies of each feature set across the four performance metrics. The box plots illustrate that the proposed model consistently achieves higher median values and less variability than the other variations, indicating a more stable and superior performance across all datasets [71].

The ANOVA results, supported by visual insights from the box plots, demonstrate that the proposed model significantly outperforms the five variations across all performance metrics. The inclusion of additional features, such as user activity, user information, and app package name statistics, in the proposed model led to notable improvements in accuracy, recall, precision, and F1-Score. These enhancements render the proposed model a reliable solution for detecting fraudulent malware in real-world applications.

4.4. Feature Importance

This section discusses the evaluated contributions of each feature to the final model. To calculate the featurewise contributions, we used TreeExplainer (TreeSHAP), which provides variable importance for each prediction case [72]. Figure 24 presents the average contribution of each feature by date for the final LightGBM. In the “User activity” and “User Information” feature sets, age (age), elapsed time after installation (ist_af_psg_drtm), and app source (app_src) were deemed important. Most malware apps are installed through an abnormal method that cannot verify the source rather than through a normal method, such as the Play Store. Therefore, app source information is crucial in determining whether an app is malicious.

In the “App Information” feature set, app size (app_sz) and app package name (app_pkg_nm) are important parameters. As malware apps are created for specific purposes such as stealing personal information, their app size is much smaller than that of benign apps. The raw data revealed that over 90% of the malware apps had a size of 20 MB or less. Moreover, the second and third words among the words in the app package name significantly influenced the inference of the malware apps. In the “App permissions” feature set, the information on request package installation permissions (request_install_packages) had the largest influence compared to information on other permissions.

4.5. Comparison with Related Works

In this section, the categories of datasets, detection methods, ML models, and performance metrics are discussed and compared with existing studies. The proposed framework outperformed existing tasks in several important aspects, including real-world data usage, scalability of real-time applications, and excellent ML performance (Table 18). Combining static analysis, undersampling, and LightGBM is more efficient and effective in detecting financial fraud malware than existing methods, providing practical solutions for large financial institutions.

5. Limitations and Future Directions

The limitations of the proposed work, along with potential future directions, are as follows:

The validity of the study may be influenced by dataset bias, as it relies on data from a single bank. Additionally, challenges related to imbalanced data, potential scalability issues in real-world deployment, and the need for continuous adaptation to evolving malware threats must be considered.
Since this model is designed for static analysis-based financial fraud malware detection, it has not been evaluated against advanced malware evasion techniques (e.g., code obfuscation, zero-day exploits). This limitation may result in an overestimation of its actual detection effectiveness against APT.
Future research will focus on enhancing the model’s generalizability by incorporating datasets from multiple banks and financial institutions across different countries.
Since random undersampling may lead to the loss of critical malicious patterns, potentially affecting detection performance, future work will explore hybrid resampling methods, such as SMOTE, cost-sensitive learning, and ensemble-based strategies, to assess their impact on performance and robustness.
Explainable AI (XAI) techniques [73] will be considered to improve the interpretability of ML-based malware detection pipelines. Additionally, incremental learning [74], which is effective in detecting emerging malware, will be explored.
Further research will investigate the application of natural language processing techniques on app names to enhance financial fraud malware detection models.

6. Conclusions

This study proposed a new framework for detecting malware apps used to commit financial fraud in Android systems. We utilized the datasets of benign and malware apps analyzed by Bank A in South Korea using the FDS system from April 2023 to September 2023. As the data on benign and malware apps were unbalanced, undersampling was performed to adjust the proportion of benign apps in the training dataset. Moreover, 92 datasets were constructed through daily training to select the optimal model. To test the proposed approach, we used five ML algorithms: logistic regression, random forest, LightGBM, XGBoost, and CatBoost. Upon calculating the optimal hyperparameter values for each algorithm using a grid search, the models were evaluated using the following common evaluation metrics: accuracy, precision, recall, F1-score, AUROC, and AUPRC. We selected the LightGBM model for our evaluation because it achieved the best performance with an accuracy of 99.99% and an F1-Score of 96.86%. Upon selecting 0.5 as the optimal decision threshold to determine whether an app was malicious, the LightGBM model was re-evaluated and yielded an accuracy of 99.99% and an F1-Score of 97.04%. Additionally, the importance of the features in the final model was evaluated. Age, app source, app size, the elapsed time after app installation, package installation request permission, and second- and third-string statistics of the package name were the most influential variables. The major achievement of this study is that we proposed a malware detection model that can be used in real operational environments by adding derived variables “User activity”, “User information”, and “App package name statics” to the “App information”, “App permissions”, and “App service” feature sets, which are commonly used in existing studies on static analysis-based malware detection. The proposed model will be highly effective in detecting financial fraud malware if applied to an FDS that detects abnormal transactions in the financial sector or financial apps.

Author Contributions

Conceptualization, J.S.; methodology, J.S.; validation, D.K.; formal analysis, J.S.; investigation, D.K.; data curation, D.K.; writing—original draft preparation, J.S.; writing—review and editing, J.S.; visualization, D.K.; supervision, K.L.; project administration, K.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data presented in this study are available on request from the corresponding author. The dataset used in this study cannot be publicly shared due to security policies and confidentiality restrictions imposed by Bank A.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

ANOVA	Analysis of variance
APK	Android package kit
AUPRC	Area under the precision–recall curve
AUROC	Area under the ROC curve
BiLSTM	Bidirectional long short-term memory
CNN	Convolutional neural network
DAN	Discriminative adversarial network
FDS	Fraud detection system
FN	False negative
FP	False positive
GBDT	Gradient boosting decision tree
GPU	Graphics processing unit
GRU	Gated recurrent unit
IBK	Instance-based k
KNN	k-nearest neighbor
LDC	Linear discriminant classification
LightGBM	Light gradient-boosting machine
LSTM	Long short-term memory
ML	Machine learning
MLP	Multi-layer perceptron
PARZC	Parametric zero-cost
RBF	Radial basis function
ROC	Receiver operating characteristic
SVM	Support vector machine
TAN	Tree-augmented naïve Bayes
TN	True negative

References

Karunanayake, N.; Rajasegaran, J.; Gunathillake, A.; Seneviratne, S.; Jourjon, G. A multi-modal neural embeddings approach for detecting mobile counterfeit apps: A case study on Google Play Store. IEEE Trans. Mob. Comput. 2022, 21, 16–30. [Google Scholar] [CrossRef]
Statista, Google Play Store: Number of Apps 2024. 2024. Available online: https://www.statista.com/statistics/266210/number-of-available-applications-in-the-google-play-store/ (accessed on 31 December 2024).
IDC Research, Apple Grabs the Top Spot in the Smartphone Market in 2023 Along with Record High Market Share Despite the Overall Market Dropping 3.2%, According to IDC Tracker. 2024. Available online: https://www.idc.com/getdoc.jsp?containerId=prUS51776424 (accessed on 31 December 2024).
Statista, Mobile OS Market Share Worldwide 2009–2024. 2024. Available online: https://www.statista.com/statistics/272698/global-market-share-held-by-mobile-operating-systems-since-2009/ (accessed on 31 December 2024).
Arora, A.; Peddoju, S.K.; Conti, M. PermPair: Android malware detection using permission pairs. IEEE Trans. Inf. Forensics Secur. 2020, 15, 1968–1982. [Google Scholar] [CrossRef]
Zhu, H.; Gu, W.; Wang, L.; Xu, Z.; Sheng, V.S. Android malware detection based on multi-head squeeze-and-excitation residual network. Expert Syst. Appl. 2023, 212, 118705. [Google Scholar] [CrossRef]
Financial IT, 4 Banking Malware Types Detected on Users’ Devices in 2023. 2023. Available online: https://financialit.net/news/banking/4-banking-malware-types-detected-users-devices-2023 (accessed on 31 December 2024).
Kyung-don, N. [Graphic News] Damages from Phishing Scams Jump over 35%, Korea Herald. 2024. Available online: https://www.koreaherald.com/article/3361908 (accessed on 31 December 2024).
Allix, K.; Bissyandé, T.F.; Jérome, Q.; Klein, J.; State, R.; Le Traon, Y. Empirical assessment of machine learning-based malware detectors for Android. Empir. Softw. Eng. 2016, 21, 183–211. [Google Scholar] [CrossRef]
Odat, E.; Yaseen, Q.M. A novel machine learning approach for Android malware detection based on the co-existence of features. IEEE Access 2023, 11, 15471–15484. [Google Scholar] [CrossRef]
Talha, K.A.; Alper, D.I.; Aydin, C. APK Auditor: Permission-based Android malware detection system. Digit. Investig. 2015, 13, 1–14. [Google Scholar] [CrossRef]
Enck, W.; Gilbert, P.; Han, S.; Tendulkar, V.; Chun, B.-G.; Cox, L.P.; Jung, J.; McDaniel, P.; Sheth, A.N. TaintDroid: An information-flow tracking system for realtime privacy monitoring on smartphones. ACM Trans Comput Syst. 2014, 32, 1–29. [Google Scholar] [CrossRef]
Lindorfer, M.; Neugschwandtner, M.; Weichselbaum, L.; Fratantonio, Y.; van der Veen, V.; Platzer, C. ANDRUBIS—1,000,000 apps later: A view on current Android malware behaviors. In Proceedings of the Third International Workshop on Building Analysis Datasets and Gathering Experience Returns for Security (BADGERS), Wroclaw, Poland, 11 September 2014; pp. 3–17. [Google Scholar] [CrossRef]
Bayazit, E.C.; Koray Sahingoz, O.; Dogan, B. Malware detection in Android systems with traditional machine learning models: A survey. In Proceedings of the 2020 International Congress on Human-Computer Interaction, Optimization and Robotic Applications (HORA), Ankara, Turkey, 26–28 June 2020; pp. 1–8. [Google Scholar] [CrossRef]
Google Play Protect, Malware Categories. Available online: https://developers.google.com/android/play-protect/phacategories (accessed on 31 December 2024).
Beroual, A.; Al-Shaikhli, I.F. A Survey on Android malwares and defense techniques. J. Comput. Theor. Nanosci. 2020, 17, 1557–1565. [Google Scholar] [CrossRef]
EyalSalman, R.T. Android stalkerware detection techniques: A survey study. In Proceedings of the 2023 IEEE Jordan International Joint Conference on Electrical Engineering and Information Technology (JEEIT), Amman, Jordan, 22–24 May 2023; pp. 270–275. [Google Scholar] [CrossRef]
Play Console Help, Hostile Downloaders. Available online: https://support.google.com/googleplay/android-developer/answer/11189134?hl=en# (accessed on 31 December 2024).
Aonzo, S.; Merlo, A.; Tavella, G.; Fratantonio, Y. Phishing attacks on modern Android. In Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security, Toronto, ON, Canada, 15–19 October 2018; Association for Computing Machinery: New York, NY, USA, 2018; pp. 1788–1801. [Google Scholar] [CrossRef]
Kaspersky, Android Spyware Detection & Removal. Available online: https://www.kaspersky.com/resource-center/preemptive-safety/spyware-on-android (accessed on 31 December 2024).
Tao, G.; Zheng, Z.; Guo, Z.; Lyu, M.R. MalPat: Mining patterns of malicious and benign Android apps via permission-related APIs. IEEE Trans. Reliab. 2018, 67, 355–369. [Google Scholar] [CrossRef]
Google. Google Play Store. Available online: https://play.google.com/store/games (accessed on 31 December 2024).
VirusShare.com. Available online: https://virusshare.com/ (accessed on 31 December 2024).
Parkour, M. Contagio Mini-Dump. Available online: http://contagiominidump.blogspot.com/ (accessed on 31 December 2024).
Tiwari, S.R.; Shukla, R.U. An Android malware detection technique based on optimized permissions and API. In Proceedings of the 2018 International Conference on Inventive Research in Computing Applications (ICIRCA), Coimbatore, India, 11–12 July 2018; pp. 258–263. [Google Scholar] [CrossRef]
AndroidPRAGuardDataset. Available online: https://sites.unica.it/pralab/en/AndroidPRAGuardDataset (accessed on 31 December 2024).
Zhang, Y.; Yang, Y.; Wang, X. A novel Android malware detection approach based on convolutional neural network. In Proceedings of the 2nd International Conference on Cryptography, Security and Privacy, Guiyang, China, 16–19 March 2018; Association for Computing Machinery: New York, NY, USA, 2018; pp. 144–149. [Google Scholar] [CrossRef]
Arp, D.; Spreitzenbarth, M.; Hübner, M.; Gascon, H.; Rieck, K. Drebin: Effective and explainable detection of Android malware in your pocket. In Proceedings of the 2014 Network and Distributed System Security Symposium; Internet Society, San Diego, CA, USA, 23–26 February 2014; Available online: https://cir.nii.ac.jp/crid/1363670320772385920 (accessed on 31 December 2024).
Baldini, G.; Geneiatakis, D. A performance evaluation on distance measures in KNN for mobile malware detection. In Proceedings of the 2019 6th International Conference on Control, Decision and Information Technologies (CoDIT), Paris, France, 23–26 April 2019; pp. 193–198. [Google Scholar] [CrossRef]
Amin, M.; Tanveer, T.A.; Tehseen, M.; Khan, M.; Khan, F.A.; Anwar, S. Static malware detection and attribution in Android byte-code through an end-to-end deep system. Future Gener. Comput. Syst. 2020, 102, 112–126. [Google Scholar] [CrossRef]
Rijin, F. Malware Dataset [Dataset]. 2017. Available online: https://www.kaggle.com/datasets/blackarcher/malware-dataset (accessed on 31 December 2024).
Millar, S.; McLaughlin, N.; Martinez del Rincon, J.; Miller, P. Multi-view deep learning for zero-day Android malware detection. J. Inf. Secur. Appl. 2021, 58, 102718. [Google Scholar] [CrossRef]
İbrahim, M.; Issa, B.; Jasser, M.B. A method for automatic Android malware detection based on static analysis and deep learning. IEEE Access 2022, 10, 117334–117352. [Google Scholar] [CrossRef]
VirusTotal. Available online: https://www.virustotal.com/gui/home/upload (accessed on 31 December 2024).
Bayazit, E.C.; Sahingoz, O.K.; Dogan, B. A deep learning based Android malware detection system with static analysis. In Proceedings of the 2022 International Congress on Human-Computer Interaction, Optimization and Robotic Applications (HORA), Ankara, Turkey, 9–11 June 2022; pp. 1–6. [Google Scholar] [CrossRef]
Investigation on Android Malware [Dataset], Datasets|Research|Canadian Institute for Cybersecurity|UNB. 2019. Available online: https://www.kaggle.com/datasets/malikbaqi12/cic-invesandmal2019-dataset (accessed on 31 December 2024).
Akbar, F.; Hussain, M.; Mumtaz, R.; Riaz, Q.; Wahab, A.W.A.; Jung, K.-H. Permissions-based detection of Android malware using machine learning. Symmetry 2022, 14, 718. [Google Scholar] [CrossRef]
Afonso, V.M.; de Amorim, M.F.; Grégio, A.R.A.; Junquera, G.B.; de Geus, P.L. Identifying Android malware using dynamically obtained features. J. Comput. Virol. Hacking Tech. 2015, 11, 9–17. [Google Scholar] [CrossRef]
MalGenome Project, Yajins Homepage. Available online: http://www.malgenomeproject.org/ (accessed on 31 December 2024).
Dash, S.K.; Suarez-Tangil, G.; Khan, S.; Tam, K.; Ahmadi, M.; Kinder, J.; Cavallaro, L. DroidScribe: Classifying Android malware based on runtime behavior. In Proceedings of the 2016 IEEE Security and Privacy Workshops (SPW), San Jose, CA, USA, 22–26 May 2016; pp. 252–261. [Google Scholar] [CrossRef]
Cai, H.; Meng, N.; Ryder, N.; Yao, D. DroidCat: Effective Android malware detection and categorization via app-level profiling. IEEE Trans. Inf. Forensics Secur. 2019, 14, 1455–1470. [Google Scholar] [CrossRef]
Allix, K.; Bissyandé, T.F.; Klein, J.; Traon, Y.L. AndroZoo: Collecting millions of Android apps for the research community. In Proceedings of the 13th International Conference on Mining Software Repositories, Austin, TX, USA, 14–22 May 2016; ACM: New York, NY, USA, 2016; pp. 468–471. [Google Scholar] [CrossRef]
Shakya, S.; Dave, M. Analysis, detection, and classification of Android malware using system calls. arXiv 2022, arXiv:2208.06130. [Google Scholar] [CrossRef]
Lashkari, A.H.; Kadir, A.F.; Taheri, L.; Ghorbani, A.A. Toward Developing a Systematic Approach to Generate Benchmark Android Malware Datasets and Classification. In Proceedings of the 52nd IEEE International Carnahan Conference on Security Technology (ICCST), Montreal, QC, Canada, 22–25 October 2018. [Google Scholar]
Hashem El Fiky, A.; Madkour, M.A.; El Shenawy, A. Android malware category and family identification using parallel machine learning. J. Inf. Technol. Manag. 2022, 14, 19–39. [Google Scholar] [CrossRef]
Rahali, A.; Lashkari, A.H.; Kaur, G.; Taheri, L.; Gagnon, F.; Massicotte, F. DIDroid: Android Malware Classification and Characterization Using Deep Image Learning. In Proceedings of the 10th International Conference on Communication and Network Security (ICCNS2020), Tokyo, Japan, 27–29 November 2020; pp. 70–82. [Google Scholar]
Chen, S.; Xue, M.; Tang, Z.; Xu, L.; Zhu, H. StormDroid: A streaminglized machine learning-based system for detecting Android malware. In Proceedings of the 11th ACM on Asia Conference on Computer and Communications Security, Xi’an China, 30 May–3 June 2016; Association for Computing Machinery: New York, NY, USA, 2016; pp. 377–388. [Google Scholar] [CrossRef]
Saracino, A.; Sgandurra, D.; Dini, G.; Martinelli, F. MADAM: Effective and efficient behavior-based Android malware detection and prevention. IEEE Trans. Dependable Secure Comput. 2018, 15, 83–97. [Google Scholar] [CrossRef]
Kouliaridis, V.; Kambourakis, G.; Geneiatakis, D.; Potha, N. Two anatomists are better than one—Dual-level Android malware detection. Symmetry 2020, 12, 1128. [Google Scholar] [CrossRef]
Hadiprakoso, R.B.; Kabetta, H.; Buana, I.K.S. Hybrid-based malware analysis for effective and efficiency Android malware detection. In Proceedings of the 2020 International Conference on Informatics, Multimedia, Cyber and Information System (ICIMCIS), Jakarta, Indonesia, 19–20 November 2020; pp. 8–12. [Google Scholar] [CrossRef]
Mahdavifar, S.; Kadir, A.F.A.; Fatemi, R.; Alhadidi, D.; Ghorbani, A.A. Dynamic Android Malware Category Classification using Semi-Supervised Deep Learning. In Proceedings of the 18th IEEE International Conference on Dependable, Autonomic, and Secure Computing (DASC), Calgary, AB, Canada, 17–24 August 2020. [Google Scholar]
Surendran, R.; Thomas, T.; Emmanuel, S. A TAN based hybrid model for Android malware detection. J. Inf. Secur. Appl. 2020, 54, 102483. [Google Scholar] [CrossRef]
sk3ptre. sk3ptre/AndroidMalware_2019. 2024. Available online: https://github.com/sk3ptre/AndroidMalware_2019 (accessed on 31 December 2024).
Yuan, Z.; Lu, Y.; Xue, Y. Droiddetector: Android malware characterization and detection using deep learning. Tsinghua Sci. Technol. 2016, 21, 114–123. [Google Scholar] [CrossRef]
He, H.; Garcia, E.A. Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 2009, 21, 1263–1284. [Google Scholar] [CrossRef]
Saito, T.; Rehmsmeier, M. The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PLoS ONE 2015, 10, e0118432. [Google Scholar] [CrossRef]
Provost, F.J.; Fawcett, T.; Kohavi, R. The case against accuracy estimation for comparing induction algorithms. In Proceedings of the Fifteenth International Conference on Machine Learning, Madison, WI, USA, 24–27 July 1998; Morgan Kaufmann Publishers Inc.: San Francisco, CA, USA, 1998; pp. 445–453. [Google Scholar]
Google Cloud Platform Console Help, Unverified Apps. Available online: https://support.google.com/cloud/answer/7454865?hl=en (accessed on 31 December 2024).
Zhou, Y.; Jiang, X. Dissecting Android malware: Characterization and evolution. In Proceedings of the 2012 IEEE Symposium on Security and Privacy, San Francisco, CA, USA, 20–23 May 2012; pp. 95–109. [Google Scholar] [CrossRef]
Felt, A.P.; Greenwood, K.; Wagner, D. The effectiveness of application permissions. In Proceedings of the 2nd USENIX Conference on Web Application Development, Portland, OR, USA, 15–16 June 2011; Available online: https://www.usenix.org/events/webapps11/tech/final_files/Felt.pdf (accessed on 31 December 2024).
Shabtai, A.; Fledel, Y.; Elovici, Y. Securing Android-powered mobile devices using SELinux. IEEE Secur. Priv. 2010, 8, 36–44. [Google Scholar] [CrossRef]
Android Developers, Services Overview|Background Work. Available online: https://developer.android.com/develop/background-work/services (accessed on 31 December 2024).
Mohammed, R.; Rawashdeh, J.; Abdullah, M. Machine learning with oversampling and undersampling techniques: Overview study and experimental results. In Proceedings of the 2020 11th International Conference on Information and Communication Systems (ICICS), Irbid, Jordan, 7–9 April 2020; pp. 243–248. [Google Scholar] [CrossRef]
Vijayvargiya, S.; Kumar, L.; Murthy, L.; Misra, S.; Krishna, A.; Padmanabhuni, S. Empirical analysis for investigating the effect of machine learning techniques on malware prediction. In Proceedings of the 18th International Conference on Evaluation of Novel Approaches to Software Engineering ENASE, Lisbon, Portugal, 24–25 April 2023; SciTePress: Prague, Czech Republic, 2023; pp. 453–460. [Google Scholar] [CrossRef]
Statistics Solutions, What Is Logistic Regression? Available online: https://www.statisticssolutions.com/free-resources/directory-of-statistical-analyses/what-is-logistic-regression/ (accessed on 31 December 2024).
Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
Ke, G.; Meng, Q.; Finley, T.; Wang, T.; Chen, W.; Ma, W.; Ye, Q.; Liu, T.-Y. LightGBM: A highly efficient gradient boosting decision tree. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Advances in Neural Information Processing, Systems. Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R., Eds.; Curran Associates Inc.: Red Hook, NY, USA, 2017. Available online: https://proceedings.neurips.cc/paper_files/paper/2017/file/6449f44a102fde848669bdd9eb6b76fa-Paper.pdf (accessed on 30 December 2017).
Chen, T.; Guestrin, C. XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; Association for Computing Machinery: New York, NY, USA, 2016; pp. 785–794. [Google Scholar] [CrossRef]
Prokhorenkova, L.; Gusev, G.; Vorobev, A.; Dorogush, A.V.; Gulin, A. CatBoost: Unbiased boosting with categorical features. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, Montréal, QC, Canada, 3–8 December 2018. [Google Scholar]
Dorogush, A.V.; Ershov, V.; Gulin, A. CatBoost: Gradient boosting with categorical features support. arXiv 2018, arXiv:1810.11363. [Google Scholar] [CrossRef]
Kumar, L.; Hota, C.; Mahindru, A.; Neti, L.B.M. Android malware prediction using extreme learning machine with different kernel functions. In Proceedings of the 15th Asian Internet Engineering Conference, Phuket, Thailand, 7–9 August 2019; Association for Computing Machinery: New York, NY, USA, 2019; pp. 33–40. [Google Scholar] [CrossRef]
Lundberg, S.M.; Erion, G.; Chen, H.; DeGrave, A.; Prutkin, J.M.; Nair, B.; Katz, R.; Himmelfarb, J.; Bansal, N.; Lee, S.-I. Explainable AI for trees: From local explanations to global understanding. arXiv 2019, arXiv:1905.04610. [Google Scholar] [CrossRef]
Nascita, A.; Aceto, G.; Ciuonzo, D.; Montieri, A.; Persico, V.; Pescapé, A. A survey on explainable artificial intelligence for Internet traffic classification and prediction, and intrusion detection. IEEE Commun. Surv. Tutor. 2024. [Google Scholar] [CrossRef]
Xu, X.; Zhang, X.; Zhang, Q.; Wang, Y.; Adebisi, B.; Ohtsuki, T.; Sari, H.; Gui, G. Advancing malware detection in network traffic with self-paced class incremental learning. IEEE Internet Things J. 2024, 11, 21816–21826. [Google Scholar] [CrossRef]

Figure 1. Malware categories.

Figure 2. Suggested framework for detecting malware.

Figure 3. Low data collection process.

Figure 4. Configuring experimental datasets.

Figure 5. Construction of proposed model.

Figure 6. Confusion matrix of Logistic Regression classifier with two target classes: Malware and Benign.

Figure 7. ROC–AUC obtained using Logistic Regression.

Figure 8. Precision-Recall curve obtained using Logistic Regression.

Figure 9. Confusion matrix of Random Forest classifier with two target classes.

Figure 10. ROC–AUC curve obtained using Random Forest.

Figure 11. Precision-Recall curve obtained using Random Forest.

Figure 12. Confusion matrix of LightGBM classifier with two target classes.

Figure 13. ROC–AUC curve obtained using LightGBM.

Figure 14. Precision-Recall curve obtained using LightGBM.

Figure 15. Confusion matrix of XGBoost classifier with two target classes.

Figure 16. ROC–AUC curve obtained using XGBoost.

Figure 17. Precision-Recall curve obtained using XGBoost.

Figure 18. Confusion matrix of CatBoost classifier with two target classes.

Figure 19. ROC–AUC curve obtained using CatBoost.

Figure 20. Precision-Recall curve obtained using CatBoost.

Figure 21. Histogram of f-statistics for different metrics.

Figure 22. Histogram of p-values for different metrics.

Figure 23. Boxplots for evaluation metrics for feature selection techniques.

Figure 24. Datewise average feature importance for the proposed model.

Table 1. Comparison of some related studies that utilized static analysis.

Studies	Year	Feature(s)	Dataset(s)	Algorithm(s)	Performance Evaluation
Tao et al. [21]	2017	API calls	Google Play [22], VirusShare [23], and Contagio [24]	DT	F1-score: 98.24%
Tiwari et al. [25]	2018	Permissions, API calls	RPAGaurd [26] and Google Play	Logistic regression	Accuracy: 97.25% Accuracy: 95.87%
Zang et al. [27]	2018	Permissions, Intent filters, API calls, Constant strings	Drebin [28] and Chinese app markets	CNN	Accuracy: 97.40%
Baldini et al. [29]	2019	Permissions, API calls, Components, Network addresses	Drebin [28]	KNN	Accuracy: 99.48%
Amin et al. [30]	2020	Opcodes	AMD [31], Drebin [28], and VirusShare [23]	BiLSTM, LSTM, CNN, and DBN	Accuracy: 99.90% F1-score: 99.60%
Millar et al. [32]	2021	Raw opcodes, Permissions and API calls	Drebin [28]	DAN	F1-score: 97.30%
İbrahim et al. [33]	2022	Permissions, Services, API calls, Broadcast receivers, File size, Fuzzy Hash, Opcode sequence	Virus Total [34], AMD [31], MalDozer, and Contactio Security Blog	Functional API-based deep learning	F1-score: 99.5%, F1-score: 97%
Bayazit et al. [35]	2022	Permissions, Intents	CICInvesAndMal2019 [36]	RNN-based LSTM, BiLSTM, and GRU	Accuracy: 98.85% F1-score: 98.21%
Akbar et al. [37]	2022	Permissions	VirusShare [23]	Random forest, SVM, rotation forest, and Naïve Bayes	Accuracy: Greater than or equal to 89%
Proposed Method		App information, App permissions, App Service, User activity, User information, App package name statistics	Private Dataset	LightGBM	Accuracy: 99.99% F1-score: 97.04%

Table 2. Comparison of related studies that utilized dynamic analysis.

Studies	Year	Feature(s)	Dataset(s)	Algorithm(s)	Performance Evaluation
Afonso et al. [38]	2015	API calls and System calls	MalGenome [39] and VirusShare [23]	RF, J.48, Simple Logistic, NB, SMO, BayesNet, and IBK	Accuracy: 96.66%
Dash et al. [40]	2016	Binder communication, System calls	Drebin [28]	SVM	Accuracy: 94.00%
Cai et al. [41]	2019	Method calls and Inter-component communication (ICC), Intents	Drebin [28], MalGenome [39], VirusShare [23], and AndroZoo [42]	DroidCat	Accuracy: 97.00%
Shakya et al. [43]	2022	System calls	CIC-ANDMAL2017 [44]	KNN and Decision tree	F1-score: 85% F1-score: 72%
El Fiky et al. [45]	2022	Memory Features, APIs, Networks, Batterys, Logcats, Processes	CCCS-CIC-AndMal (2020) [46]	J48, KNN, SVM, and random forest	Accuracy: 96.89% Accuracy: 99.65%

Table 3. Comparison of related studies that utilized hybrid analysis.

Studies	Year	Feature(s)	Dataset(s)	Algorithm(s)	Performance Evaluation
Chen et al. [47]	2016	Permissions, Sensitive API Call, Dynamic behavior Sequences	Google Play [22] and Contagio [24]	StormDroid (SVM, MLP, C4.5, IBK, NB, and Bagging predictor)	Accuracy: 93.80%
Saracino et al. [48]	2018	System calls, SMS, Critical API, User Activity, App Metadata	MalGenome [39], Contagio [24], and VirusShare [23]	K-NN, QDC, LDC, PARZC, MLP, and RBF	Accuracy: 96.90%
Kouliaridis et al. [49]	2020	API calls, Permissions, Intents, Network traffic, Java classes, Inter-process communication	Drebin [28], VirusShare [23], and AndroZoo [42]	LR, Naïve Bayes, RF, KNN, SGC, Adaboost, SVM, and Ensemble	Drebin: 100%, VirusShare: 100%, AndroZoo: 91.8%
Hadiprakoso et al. [50]	2020		MalGenome [39], Drebin [28], and CICMalDroid 2020 [51]	Gradient boost (GB)	Accuracy: 99.36%
Surendran et al. [52]	2020	API calls, Permissions, System calls	Drebin [28], AMD [31], AndroZoo [38], Github [53], and Google Play	Tree Augmented Naïve Bayes (TAN)	Accuracy: 97.00%

Table 4. Variable information of the collected raw data.

Category	Variable Name	Description	Other Information
Target	lbl_cd	Label	“Benign” or “Malware”
Date	trsc_dt	Transaction date	Login date
App information	app_nm	App name
	app_pkg_nm	App package name
	app_sz	App size
	app_ist_dt	App installation date
App permissions	read_sms	SMS read permission	“Y” or “N”
	receive_sms	SMS write permission	“Y” or “N”
	process_outgoing_calls	Permission to process outgoing calls	“Y” or “N”
	call_phone	Permission to make calls	“Y” or “N”
	request_install_packages	Permission to request the installation of packages	“Y” or “N”
	manage_external_storage	Permission to manage shared storage	“Y” or “N”
App service	basc_phone_app	Whether it is the default phone app	“Y” or “N”
User activity	app_src	App source	“Source Verification” or “Source Unverified”

Table 5. Information on selected feature sets—Part 1.

Feature Set	Feature Name	Description	Type
Target	lbl_cd	label	Categorical
Date	trsc_dt	Transaction date	Date
App information	app_pkg_nm	App package name	Categorical
	app_sz	App size	Numerical
	app_sz_yn	Whether the app size is Null	Categorical
App permissions	read_sms	SMS read permission	Categorical
	receive_sms	SMS write permission	Categorical
	process_outgoing_calls	Permission to process outgoing calls	Categorical
	call_phone	Permission to make calls	Categorical
	request_install_packages	Permission to request the installation of packages	Categorical
	manage_external_storage	Permission to manage shared storage	Categorical
App service	basc_phone_app	Whether it is the default phone app	Categorical
User activity	app_src	App source	Categorical
	app_src_cnfm_yn	Whether the app source is verified	Categorical
	Ist_af_psg_drtm	The elapsed time after the installation (Installation date to Transaction date)	Numerical
	Ist_af_pst_drtm_yn	Whether the elapsed time after the installation is Null	Categorical
User information	age	Age	Numerical
	age_yn	Whether the age is Null	Categorical
	gndr_dv_cd	Gender	Categorical

Table 6. Information on selected feature sets—Part 2.

Feature Set	Feature Name	Description	Type
App package name statistics	nm_len	Length	Numerical
	word_len	Number of words	Numerical
	word_list_len_std	The standard deviation of the length of individual words in the word list	Numerical
	word_list_len_mean	The mean of the length of individual words in the word list	Numerical
	num_ratio	The ratio of numbers	Numerical
	vowel_ratio	The ratio of vowels	Numerical
	consonant_ratio	The ratio of consonants	Numerical
	consecutive_num_ratio	MAX (the ratio of consecutive numbers)	Numerical
	consecutive_vowel_ratio	MAX (the ratio of consecutive vowels)	Numerical
	consecutive_consonant_ratio	MAX (the ratio of consecutive consonants)	Numerical
	word_1_len	The length of the first word	Numerical
	word_1_num_ratio	The ratio of numbers in the first word	Numerical
	word_1_vowel_ratio	The ratio of vowels in the first word	Numerical
	word_1_consonant_ratio	The ratio of consonants in the first word	Numerical
	word_1_consecutive_num_ratio	MAX (the ratio of consecutive numbers in the first word)	Numerical
	word_1_consecutive_vowel_ratio	MAX (the ratio of consecutive vowels in the first word)	Numerical
	word_1_consecutive_consonant_ratio	MAX (the ratio of consecutive consonants in the first word)	Numerical
	word_6_consecutive_num_ratio	MAX (the ratio of consecutive numbers in the sixth word)	Numerical
	word_6_consecutive_vowel_ratio	MAX (the ratio of consecutive vowels in the sixth word)	Numerical
	word_6_consecutive_consonant_ratio	MAX (the ratio of consecutive consonants in the sixth word)	Numerical

Table 7. Number of benign and malware apps by dataset.

Model Set (Daily)	Train Set		Test Set
Model Set (Daily)	Benign	Malware	Benign	Malware
Dataset 1	66,027,370	6397	841,131	18
Dataset 2	66,238,910	6415	607,316	24
Dataset 3	66,287,400	6439	1,180,391	81
-----	-----	-----	-----	-----
Dataset 90	71,891,560	11,975	689,702	7
Dataset 91	71,499,660	11,982	534,351	4
Dataset 92	70,938,520	11,986	605,394	2

Table 8. Calculated optimal hyperparameter values for different algorithms.

Algorithm	Hyperparameter Grid
Logistic Regression	{“max_iter”: 100, “class_weight”: “balanced”, “fit_intercept”: True, “C” (regularization strength): 1.0, “penalty”: “l2”}
Random Forest	{“n_estimators”: 200, “class_weight”: “balanced”, “max_depth”: 64, “bootstrap”: false}
LightGBM	{“n_estimators”: 300, “learning_rate”: 0.1, “max_depth”: −1, “class_weight”: “balanced”, “max_delta_step”: 100, “num_leaves”: 128, “colsample_bytree”: 0.5, “subsample”: 1, “objective”: “binary”, “boost_from_average”: “false”}
XGBoost	{“n_estimators”: 200, “learning_rate”: 0.1, “scale_pos_weight”: num_neg_samples/num_pos_samples, “max_depth”: 0}
CatBoost	{“iterations”: 200, “learning_rate”: 0.03, “subsample”: 0.8, “scale_pos_weight”: num_benign_samples/num_malware_samples, “depth”: 16}

Table 9. Results of evaluation metrics results using the Logistic Regression classifier on 92 datasets.

Model Dataset (Daily)	Accuracy	Precision	Recall	F1-Score
Dataset 1	0.856419	0.000149	1.000000	0.000298
Dataset 2	0.860246	0.000271	0.958333	0.000542
Dataset 3	0.853666	0.000411	0.876543	0.000821
-----	-----	-----	-----	-----
Dataset 91	0.860366	0.000073	1.000000	0.000145
Dataset 92	0.878249	0.000061	1.000000	0.000123
Total_Avg	0.849444	0.000376	0.924700	0.000751

Table 10. Results of evaluation metrics results using the Random Forest classifier on 92 datasets.

Model Dataset (Daily)	Accuracy	Precision	Recall	F1-Score
Dataset 1	0.999999	1.000000	0.944444	0.971429
Dataset 2	1.000000	1.000000	1.000000	1.000000
Dataset 3	0.999997	1.000000	0.950617	0.974684
-----	-----	-----	-----	-----
Dataset 91	1.000000	1.000000	1.000000	1.000000
Dataset 92	1.000000	1.000000	1.000000	1.000000
Total_Avg	0.999994	0.987150	0.906815	0.945278

Table 11. Results of evaluation metrics using the LightGBM classifier on 92 datasets.

Model Dataset (Daily)	Accuracy	Precision	Recall	F1-Score
Dataset 1	0.999998	0.900000	1.000000	0.947368
Dataset 2	1.000000	1.000000	1.000000	1.000000
Dataset 3	1.000000	1.000000	0.978261	0.989011
-----	-----	-----	-----	-----
Dataset 91	1.000000	1.000000	1.000000	1.000000
Dataset 92	1.000000	1.000000	1.000000	1.000000
Total_Avg	0.999996	0.964526	0.972635	0.968564

Table 12. Results of evaluation metrics for the XGBoost classifier on 92 datasets.

Model Dataset (Daily)	Accuracy	Precision	Recall	F1-Score
Dataset 1	0.999998	0.900000	1.000000	0.947368
Dataset 2	1.000000	1.000000	1.000000	1.000000
Dataset 3	0.999996	0.941860	1.000000	0.970060
-----	-----	-----	-----	-----
Dataset 91	0.999998	0.800000	1.000000	0.888889
Dataset 92	1.000000	1.000000	1.000000	1.000000
Total_Avg	0.999992	0.919447	0.951350	0.935127

Table 13. Results of evaluation metrics for the CatBoost classifier on 92 datasets.

Model Dataset (Daily)	Accuracy	Precision	Recall	F1-Score
Dataset 1	0.999995	0.818182	1.000000	0.900000
Dataset 2	0.999993	0.857143	1.000000	0.923077
Dataset 3	0.999994	0.920455	1.000000	0.958580
-----	-----	-----	-----	-----
Dataset 91	1.000000	1.000000	1.000000	1.000000
Dataset 92	1.000000	1.000000	1.000000	1.000000
Total_Avg	0.999995	0.929774	0.987480	0.957759

Table 14. Overall comparison of the tested classifiers with two target classes.

Model	Accuracy	Precision	Recall	F1-Score	AUROC	AUPRC
Logistic Regression	0.849444	0.000376	0.924700	0.000751	0.953914	0.014239
Random Forest	0.999994	0.987150	0.906815	0.945278	0.999461	0.982923
LightGBM	0.999996	0.964526	0.972635	0.968564	0.999559	0.987537
XGBoost	0.999992	0.919447	0.951350	0.935127	0.999145	0.963108
CatBoost	0.999995	0.929774	0.987480	0.957759	0.999988	0.977388

Table 15. Selection of decision threshold.

	Accuracy	Precision	Recall	F1-Score
LightGBM + Threshold 1	0.999996	0.964526	0.972635	0.968564
LightGBM + Threshold 2	0.999958	0.966507	0.974277	0.970376
LightGBM + Threshold 3	0.999959	0.969151	0.972267	0.970706

Table 16. Variations of feature sets for feature selection techniques.

	Feature Sets
Variation1	App information, App permissions, App service
Variation2	App information, App permissions, App service, User activity
Variation3	App information, App permissions, App service, User information
Variation4	App information, App permissions, App service, App package name statistics
Variation5	App information, App permissions, App service, User activity, App package name statistics
Our Proposed	App information, App permissions, App service, User activity, User information, App package name statistics

Table 17. Statistical measures of Accuracy, Recall, Precision, and F1-Score: Variation feature sets.

Accuracy
	Min	Max	Mean	Median	25%	75%
Variation1	0.999551	0.999976	0.999813	0.999817	0.999763	0.999875
Variation2	0.999646	0.999978	0.999901	0.999914	0.999875	0.999938
Variation3	0.999526	1.000000	0.999848	0.999869	0.999795	0.999905
Variation4	0.999615	1.000000	0.999873	0.999883	0.999841	0.999927
Variation5	0.999773	1.000000	0.999934	0.999948	0.999902	0.999975
Our Proposed	0.999819	1.000000	0.999961	0.999972	0.999930	1.000000
Recall
	Min	Max	Mean	Median	25%	75%
Variation1	0.500000	1.000000	0.829375	0.847826	0.782269	0.888889
Variation2	0.500000	1.000000	0.879349	0.900980	0.833333	0.953611
Variation3	0.500000	1.000000	0.838819	0.848913	0.780382	0.905844
Variation4	0.500000	1.000000	0.846844	0.857143	0.797368	0.911275
Variation5	0.714286	1.000000	0.966149	0.973684	0.945571	1.000000
Our Proposed	0.714286	1.000000	0.976437	1.000000	0.970129	1.000000
Precision
	Min	Max	Mean	Median	25%	75%
Variation1	0.333333	0.975000	0.819563	0.851648	0.750000	0.921559
Variation2	0.500000	1.000000	0.916453	0.950000	0.897222	0.968750
Variation3	0.400000	1.000000	0.882924	0.907670	0.857143	0.963294
Variation4	0.500000	1.000000	0.910078	0.948684	0.888889	0.972410
Variation5	0.625000	1.000000	0.932044	0.942017	0.901829	0.974519
Our Proposed	0.714286	1.000000	0.961714	0.974359	0.944444	1.000000
F1-Score
	Min	Max	Mean	Median	25%	75%
Variation1	0.400000	0.986667	0.821788	0.852459	0.747685	0.895522
Variation2	0.500000	0.987013	0.895133	0.923077	0.857143	0.953416
Variation3	0.444444	1.000000	0.857345	0.878356	0.819141	0.918254
Variation4	0.500000	1.000000	0.875704	0.895522	0.841667	0.938776
Variation5	0.714286	1.000000	0.947998	0.957746	0.932323	0.982456
Our Proposed	0.714286	1.000000	0.968545	0.978251	0.956522	1.000000

Table 18. Comparative analysis with existing work.

Category	Proposed Work	Existing Work	Comparison
Datasets	Real-world dataset from Bank A (South Korea) with 183,938,730 benign transactions and 11,986 malware transactions; highly imbalanced	Most existing works (e.g., DroidDetector [54]) used smaller, lab-controlled datasets, often from third-party app stores or Google Play	The real-world, large-scale dataset offers better generalizability compared to smaller, simulated datasets used in existing work.
Detection Methodology	Static analysis (feature extraction) for both known and unknown malware.	Existing methods like DroidDetector [54] and Zhou et al. [59] rely primarily on dynamic analysis, with high computational overheads (e.g., TaintDroid [12])	Static analysis in the proposed work is more scalable and resource-efficient for large operational environments.
Machine Learning Models	Tested models: Logistic Regression, Random Forest, LightGBM, XGBoost, CatBoost. LightGBM showed the best performance.	Random Forest and SVM are commonly used but struggle with large, imbalanced datasets; some use deep learning, which is computationally expensive	LightGBM is faster, scalable, and better at handling imbalanced data compared to traditional models like SVM and Random Forest.
Performance Metrics	Achieved 99.99% accuracy, 96.86% precision, 97.04% F1-score, AUROC of 0.999559, and AUPRC of 0.987537	DroidDetector [54] had 96.76% accuracy and 92.85% F1-score; other models struggled with imbalanced datasets	The proposed model shows significant improvements in F1-score and AUPRC, effectively handling imbalanced data with fewer false positives.
Real-World Application	Designed for real-time deployment in financial institutions; uses a lightweight model (LightGBM) to handle large-scale data	Existing models such as TaintDroid [12] and DroidDetector [54] are not scalable or practical for real-time large-scale environments	The proposed framework is more scalable and applicable to real-world financial institutions compared to existing methods.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Shin, J.; Kim, D.; Lee, K. Advanced Financial Fraud Malware Detection Method in the Android Environment. Appl. Sci. 2025, 15, 3905. https://doi.org/10.3390/app15073905

AMA Style

Shin J, Kim D, Lee K. Advanced Financial Fraud Malware Detection Method in the Android Environment. Applied Sciences. 2025; 15(7):3905. https://doi.org/10.3390/app15073905

Chicago/Turabian Style

Shin, Jaeho, Daehyun Kim, and Kyungho Lee. 2025. "Advanced Financial Fraud Malware Detection Method in the Android Environment" Applied Sciences 15, no. 7: 3905. https://doi.org/10.3390/app15073905

APA Style

Shin, J., Kim, D., & Lee, K. (2025). Advanced Financial Fraud Malware Detection Method in the Android Environment. Applied Sciences, 15(7), 3905. https://doi.org/10.3390/app15073905

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Advanced Financial Fraud Malware Detection Method in the Android Environment

Abstract

1. Introduction

2. Related Work

2.1. Static Analysis-Based Malware Detection

2.2. Dynamic Analysis-Based Malware Detection

2.3. Hybrid Analysis-Based Malware Detection

3. Proposed Malware Detection Approach

3.1. Data Collection

3.2. Feature Extraction

3.3. Data Preprocessing

3.4. Dataset

3.5. Algorithms

3.5.1. Logistic Regression

3.5.2. Random Forest

3.5.3. LightGBM

3.5.4. XGBoost

3.5.5. CatBoost

3.6. Construction of the Proposed Model

3.7. Evaluation Metrics

4. Experimental Results and Discussion

4.1. Model Evaluation

4.1.1. Evaluation of Logistic Regression

4.1.2. Evaluation of Random Forest

4.1.3. Evaluation of LightGBM

4.1.4. Evaluation of XGBoost

4.1.5. Evaluation of CatBoost

4.2. Selected Model and Decision Threshold

4.3. Comparative Analysis and Statistical Validation of Proposed Model

4.4. Feature Importance

4.5. Comparison with Related Works

5. Limitations and Future Directions

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI