RepackDroid: An Efficient Detection Model for Repackaged Android Applications
Abstract
1. Introduction
- Feature Explosion: Some state-of-the-art approaches extract hundreds of features (e.g., APIs, permissions, control flow graphs), which leads to high-dimensional data, longer training times, and limited portability [7].
- Obfuscation and Code Transformation: Attackers frequently employ obfuscation, reordering, and library injection to disguise malicious payloads, which can confuse both similarity-based and machine learning-based detectors.
- Dataset Limitations: The lack of large, high-quality benchmark datasets hindered fair evaluation and reproducibility of proposed methods.
- Novel Feature Extraction Framework: We design and implement a tool that automatically processes APKs into a compact set of features tailored for repackaging detection.
- Efficient Learning with Fewer Features: We demonstrate that a reduced feature space (20 features) can outperform prior models trained on more than 500 features, achieving improved recall and F1 scores while reducing computational overhead.
- Comprehensive Empirical Evaluation: Using 8441 Android applications from the RePack dataset [8], we show that Support Vector Machines achieve up to 98.8% recall and 85.9% F1 score, outperforming state-of-the-art baselines. We also evaluate the predictive power of the string offset order anomaly, validating its role as a cost-effective indicator of repackaging.
2. Related Work
3. The RepackDroid Approach
3.1. Overview of Approach
- Dataset Preparation: We curate and preprocess applications from the RePack dataset [8], ensuring that each sample is properly labeled as original or repackaged.
- APK Decompilation and Parsing: Applications are decompiled using established tools to expose manifest files, Dalvik executable (DEX) bytecode, and intermediate smali code representations.
- Feature Extraction: We derive 20 discriminative features from four categories—inter-component communication (ICC), permissions, sensitive API usage, and structural anomalies.
- Classification: Supervised learning models are trained and tested to predict whether an application is repackaged. The feature set includes AndroidSOO’s [22] string offset order anomaly as a low-cost, structural indicator of repackaging.
3.2. Dataset Construction
- AndroidManifest.xml: Containing package metadata, permissions, and component definitions.
- Classes.dex: Dalvik bytecode compiled from Java sources.
- .smali files: Human-readable intermediate code representation from DEX disassembly.
3.3. Feature Extraction
3.3.1. APK Decompilation and Preservation of DEX Files
| Algorithm 1 Feature Extraction Workflow. |
|
3.3.2. Parsing Inter-Component Communication (ICC)
- ICC Name: the method central to initializing the communication (e.g., startActivity(), sendBroadcast()).
- Source Component: the component from which the intent originates.
- Target Component: the intended recipient of the communication, which may not always be identifiable due to limitations of smali code analysis. In cases where the target cannot be resolved, it is assigned a null value.
- Type of Communication: whether the intent is internal (within the same app) or external (directed outside the app), typically inferred through manifest file analysis.
3.3.3. Metadata and Preprocessing Output
- Single-row metadata—such as the permissions list, AndroidSOO output, and binary classification label (repackaged vs. original).
- Iteratively updated fields—such as counts of Android APIs, Java APIs, and user actions, which were incremented with each discovered ICC tuple.
3.3.4. Consolidation and Feature Engineering
- Computed new numerical features derived from ICC tuples and their relationships.
- Transferred metadata fields from the first row of each CSV into the consolidated dataset.
3.4. Feature Set Design
3.4.1. ICC Features
3.4.2. Permissions Features
3.4.3. Sensitive API Features
- Total API count (sum of Android and Java APIs).
- API-per-component ratio, estimating the average intensity of API usage.
- User action counts, measuring the frequency of user-triggered interactions.
- User action–per-component ratio, normalizing these interactions by application size.
3.4.4. Structural Anomaly Feature: String Offset Order
4. Dataset Characterization and Preprocessing
4.1. Correlation Analysis
4.2. Distributional Analysis
4.3. Outlier Detection and Filtering
5. Evaluation Results
5.1. Dataset Summary and Preprocessing
5.2. Evaluation Metrics
5.3. Model Performance
- Neural Network achieved the best recall (100%), successfully identifying all repackaged samples in the test set.
- SVM achieved the highest F1-score (85.9%), offering the best balance between precision and recall.
- Logistic Regression achieved the highest PR-AUC (87.8%), demonstrating stable performance across thresholds.
6. Discussion
6.1. Comparative Analysis with State-of-the-Art
6.2. Role of String Offset Order in Detection
6.3. Summary of Results and Main Findings
6.3.1. Competitive Performance with Fewer Features
6.3.2. High Recall Prioritized over Precision
6.3.3. Algorithm-Specific Strengths
- Decision Tree achieved the highest precision (77.6%).
- Neural Network reached perfect recall (100%) but at the expense of lower precision.
- Logistic Regression yielded the strongest PR-AUC (87.8%), reflecting stability across thresholds.
- Support Vector Machine provided the best overall balance with the highest F1-score (85.9%). This diversity suggests that ensemble methods, combining strengths of different algorithms, could further enhance detection performance in future work.
6.3.4. Validation of Symptom Discovery
6.3.5. Broader Implications
7. Threats to Validity
8. Conclusions and Future Work
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Living in a Multi-Device World with Android. Available online: https://blog.google/products/android/io22-multideviceworld/ (accessed on 1 October 2025).
- Zhou, Y.; Jiang, X. Dissecting Android Malware: Characterization and Evolution. In Proceedings of the IEEE Symposium on Security and Privacy, San Francisco, CA, USA, 20–23 May 2012; pp. 95–109. [Google Scholar] [CrossRef]
- Trend Micro. A Look at Repackaged Apps and their Effect on the Mobile Threat Landscape. Trend Micro Blog. Available online: http://blog.trendmicro.com/trendlabs-security-intelligence/a-look-into-repackaged-apps-and-its-rolein-the-mobile-threat-landscape/ (accessed on 17 December 2021).
- Glanz, L.; Amann, S.; Eichberg, M.; Reif, M.; Hermann, B.; Lerch, J.; Mezini, M. CodeMatch: Obfuscation won’t conceal your repackaged app. In Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering, Paderborn, Germany, 4–8 September 2017; pp. 638–648. [Google Scholar] [CrossRef]
- Zhou, W.; Zhou, Y.; Jiang, X.; Ning, P. Detecting repackaged smartphone applications in third-party android marketplaces. In Proceedings of the Second ACM Conference on Data and Application Security and Privacy, San Antonio, TX, USA, 7–9 February 2012; pp. 317–326. [Google Scholar]
- Kim, B.; Lim, K.; Cho, S.-J.; Park, M. RomaDroid: A Robust and Efficient Technique for Detecting Android App Clones. IEEE Access 2019, 7, 72182–72196. [Google Scholar] [CrossRef]
- Li, L.; Bissyandé, T.F.; Klein, J. Rebooting Research on Detecting Repackaged Android Apps: Literature Review and Benchmark. IEEE Trans. Softw. Eng. 2019, 47, 676–693. [Google Scholar] [CrossRef]
- Li, L.; Bissyandé, T.; Klein, J. RePack. Github. Available online: https://github.com/serval-snt-uni-lu/RePack (accessed on 17 December 2024).
- Wolfe, B.; Elish, K.; Yao, D. Comprehensive Behavior Profiling for Proactive Android Malware Detection. In Proceedings of the Information Security Conference (ISC), Hong Kong, China, 12–14 October 2014. [Google Scholar]
- Zhou, Y.; Wang, Z.; Zhou, W.; Jiang, X. Hey, You, Get off of My Market: Detecting Malicious Apps in Official and Alternative Android Markets. In Proceedings of the 19th Network and Distributed System Security Symposium (NDSS), San Diego, CA, USA, 5–8 February 2012. [Google Scholar]
- Dave, D.D.; Rathod, D. Systematic review on various techniques of android malware detection. In Communications in Computer and Information Science; Springer International Publishing: Cham, Switzerland, 2022; pp. 82–99. [Google Scholar]
- Wolfe, B.; Elish, K.; Yao, D. High Precision Screening for Android Malware with Dimensionality Reduction. In Proceedings of the International Conference on Machine Learning and Applications, Detroit, MI, USA, 3–6 December 2014. [Google Scholar]
- Alzaylaee, M.K.; Yerima, S.Y.; Sezer, S. DL-Droid: Deep learning based Android malware detection using real devices. Comput. Secur. 2020, 89, 101663. [Google Scholar] [CrossRef]
- Elish, K.; Yao, D.; Ryder, B. User-Centric Dependence Analysis for Identifying Malicious Mobile Apps. In Proceedings of the IEEE Mobile Security Technologies (MoST) Workshop, in Conjunction with the IEEE Symposium on Security and Privacy, San Francisco, CA, USA, 24 May 2012. [Google Scholar]
- Elish, K.O.; Shu, X.; Yao, D.D.; Ryder, B.G.; Jiang, X. Profiling user-trigger dependence for Android malware detection. Comput. Secur. 2014, 49, 255–273. [Google Scholar] [CrossRef]
- Martín, I.; Hernández, J.A. CloneSpot: Fast detection of Android repackages. Future Gener. Comput. Syst. 2019, 94, 740–748. [Google Scholar] [CrossRef]
- Zhou, W.; Zhang, X.; Jiang, X. AppInk: Watermarking android apps for repackaging deterrence. In Proceedings of the 8th ACM SIGSAC Symposium on Information, Computer and Communications Security, Hangzhou, China, 8–10 May 2013. [Google Scholar]
- Nguyen, T.; Mcdonald, J.T.; Glisson, W.; Andel, T. Detecting Repackaged Android Applications Using Perceptual Hashing. In Proceedings of the 53rd Hawaii International Conference on System Sciences, Maui, HI, USA, 7–10 January 2020. [Google Scholar] [CrossRef]
- Lin, Y.-D.; Lai, Y.-C.; Chen, C.-H.; Tsai, H.-C. Identifying Android Malicious Repackaged Applications by Thread-Grained System Call Sequences. Comput. Secur. 2013, 39, 340–350. [Google Scholar] [CrossRef]
- Tian, K.; Yao, D.; Ryder, B.G.; Tan, G. Analysis of Code Heterogeneity for High-Precision Classification of Repackaged Malware. In Proceedings of the 2016 IEEE Security and Privacy Workshops (SPW), San Jose, CA, USA, 22–26 May 2016; pp. 262–271. [Google Scholar]
- Wu, D.J.; Mao, C.H.; Wei, T.E.; Lee, H.M.; Wu, K.P. Droidmat: Android Malware Detection Through Manifest and API Calls Tracing. In Proceedings of the 2012 Seventh Asia Joint Conference on Information Security, Tokyo, Japan, 9–10 August 2012. [Google Scholar]
- Gonzalez, H.; Kadir, A.; Stakhanova, N.; Alzahrani, A.; Ghorbani, A. Exploring reverse engineering symptoms in Android apps. In Proceedings of the Eighth European Workshop on System Security, Bordeaux, France, 21 April 2015. [Google Scholar]
- Androzoo. Available online: https://androzoo.uni.lu/ (accessed on 10 September 2023).
- Apktool: A Tool for Reverse Engineering Android Apk Files. Available online: https://apktool.org/ (accessed on 15 September 2023).
- Baksmali. Available online: https://github.com/JesusFreke/smali (accessed on 15 September 2023).
- Elish, K.; Cai, H.; Barton, D.; Yao, D.; Ryder, B. Identifying Mobile Inter-App Communication Risks. IEEE Trans. Mob. Comput. 2020, 19, 90–102. [Google Scholar] [CrossRef]
- Tian, K.; Yao, D.; Ryder, B.G.; Tan, G.; Peng, G. Detection of Repackaged Android Malware with Code-Heterogeneity Features. IEEE Trans. Dependable Secur. Comput. 2017, 17, 64–77. [Google Scholar] [CrossRef]
- Permissions on Android: Android Developers. Available online: https://developer.android.com/guide/topics/permissions/overview (accessed on 18 March 2025).
- Sun, B.; Li, Q.; Guo, Y.; Wen, Q.; Lin, X.; Liu, W. Malware family classification method based on static feature extraction. In Proceedings of the 2017 3rd IEEE International Conference on Computer and Communications (ICCC), Chengdu, China, 13–16 December 2017; pp. 507–513. [Google Scholar]
- SMOTE—Azure Machine Learning. Microsoft Docs. Available online: https://docs.microsoft.com/en-us/azure/machine-learning/component-reference/smote (accessed on 21 July 2025).
- Branco, P.; Torgo, L.; Ribeiro, R. A Survey of Predictive Modeling on Imbalanced Domains. ACM Comput. Surv. 2017, 49, 31:1–31:50. [Google Scholar] [CrossRef]
- Li, L.; Bissyandé, T.F.; Klein, J. SimiDroid: Identifying and explaining similarities in Android apps. In Proceedings of the 2017 IEEE Trustcom/BigDataSE/ICESS, Sydney, NSW, Australia, 1–4 August 2017. [Google Scholar]
- Arp, D.; Spreitzenbarth, M.; Hübner, M.; Gascon, H.; Rieck, K. Drebin: Efficient and Explainable Detection of Android Malware in Your Pocket. In Proceedings of the 21st Annual Network and Distributed System Security Symposium (NDSS), San Diego, CA, USA, 23–26 February 2014. [Google Scholar]
- Wei, F.; Li, Y.; Roy, S.; Ou, X.; Zhou, W. Deep Ground Truth Analysis of Current Android Malware. In Proceedings of the 14th International Conference on the Detection of Intrusions and Malware, and Vulnerability Assessment (DIMVA), Bonn, Germany, 6–7 July 2017. [Google Scholar]




| ICC Name | Source Component | Target Component | Type of Communication |
|---|---|---|---|
| startActivity | com/smartandroidapps/equalizer/ AApplication$3 | android.intent.action.VIEW | external |
| sendBroadcast | com/smartandroidapps/equalizer/ AApplication | com.smartandroidapps.equalizer. unlock | external |
| startService | com/smartandroidapps/equalizer/ ApplyShortcutProfile | com/smartandroidapps /equalizer/UpdateService | external |
| sendBroadcast | com/smartandroidapps/equalizer/ ApplyShortcutProfile | null | external |
| sendBroadcast | com/smartandroidapps/equalizer/ AudioFx$15 | null | external |
| Feature | Explanation |
|---|---|
| BroadcastReceiverOccurenceFrequency | int: Frequency of Broadcast Receiver Type ICC |
| ActivityOccurenceFrequency | int: Frequency of Activity Type ICC |
| ServicesOccurenceFrequency | int: Frequency of Services Type ICC |
| TotalComponentFrequency | int: BroadcastReceiverOccurenceFrequency + ActivityOccurenceFrequency + ServicesOccurenceFrequency: Frequency of all components |
| MostCommonTargetComponent | int: Frequency of the most common target component |
| MostCommonSourceComponent | int: Frequency of the most common source component |
| InternalOccurence | int: Frequency of Internal ICC |
| ExternalOccurence | int: Frequency of External ICC |
| Permissions: | int Total Amount of Permissions |
| NormalPerms: | int Amount of Normal Permissions |
| DangerousPerms: | int Amount of Dangerous Permissions |
| SignaturePerms: | int Amount of Signature Permissions |
| RiskRatePerPerms | int: DangerousPerms/Permissions: ‘Riskiness’ of permission spread |
| androidApiCount | int: Amount of Android specific API |
| javaApiCount | int: Amount of Java Specific API |
| totalApiCount | int: androidApiCount + javaApiCount: Total API count |
| apiPerComponent | int totalApiCount/TotalComponentFrequency: Average amount of API use per component |
| userActionCount: | int: Frequency of user action in the code |
| userActionPerComponent | int: userActionCount/TotalComponentFrequency: Average amount of user actions per component |
| StringOffset | bool: Is the String Offset out of order? |
| Broadcast Receivers | Activity | Services |
|---|---|---|
| sendBroadcast | startActivities | startService |
| sendBroadcastAsUser | startActivity | bindService |
| sendOrderedBroadcastAsUser | startActivityForResult | |
| sendStickyBroadcast | startActivityFromChild | |
| sendStickyBroadcastAsUser | startActivityFromFragment | |
| sendStickyOrderedBroadcast | startActivityIfNeeded | |
| sendStickyOrderedBroadcastAsUser |
| Algorithm | Precision | Recall | F-1 Score | PR AUC |
|---|---|---|---|---|
| Support Vector Machine | 0.720 | 0.988 | 0.859 | 0.835 |
| K-Nearest Neighbors | 0.734 | 0.939 | 0.824 | 0.869 |
| Gaussian Naive Bayes | 0.721 | 0.948 | 0.819 | 0.773 |
| Logistic Regression | 0.719 | 0.991 | 0.833 | 0.782 |
| Random Forest | 0.717 | 0.994 | 0.833 | 0.868 |
| Decision Tree | 0.776 | 0.805 | 0.790 | 0.701 |
| XGBoost | 0.768 | 0.827 | 0.796 | 0.846 |
| Neural Network | 0.717 | 1.000 | 0.835 | 0.731 |
| Technique | Approach Type | Features Used | Test Size (Apps) | Reported Accuracy (%) |
|---|---|---|---|---|
| CodeMatch [4] | Similarity Comparison | Opcode-based, library filtering | 1000 | Precision: 85, Recall: 99, F-1 Score: 90 |
| RomaDroid [6] | Similarity Comparison | Manifest-based | 620 | Precision: 90.9, Recall: 98.3, F-1 Score: 94.5 |
| SimiDroid [32] | Similarity Comparison | Hybrid (UI + code) | 620 | Precision: 99.4, Recall: 60.6, F-1 Score: 75.3 |
| Li et al. [7] | Supervised Learning | 521 features (permissions, APIs, capabilities) | 18,073 | Precision: 83.2, Recall: 79.1, and F-Measure: 81.1 |
| RepackDroid (Our Approach) | Supervised + Symptom Discovery | 20 features (ICC, permissions, APIs, string offset order) | 8441 | Precision: 72.0, Recall: 98.8, and F-1 Score: 85.9 |
| Measure | Value | Derivations |
|---|---|---|
| Recall | 0.6575 | TPR = TP/(TP + FN) |
| Specificity | 0.7267 | SPC = TN/(FP + TN) |
| Precision | 0.8796 | PPV = TP/(TP + FP) |
| Negative Predictive Value | 0.4114 | NPV = TN/(TN + FN) |
| False Discovery Rate | 0.1204 | FDR = FP/(FP + TP) |
| False Negative Rate | 0.3425 | FNR = FN/(FN + TP) |
| Accuracy | 0.6746 | ACC = (TP + TN)/(P + N) |
| F1 Score | 0.7525 | F1 = 2TP/(2TP + FP + FN) |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Leadon, T.; Elish, K. RepackDroid: An Efficient Detection Model for Repackaged Android Applications. Information 2025, 16, 1075. https://doi.org/10.3390/info16121075
Leadon T, Elish K. RepackDroid: An Efficient Detection Model for Repackaged Android Applications. Information. 2025; 16(12):1075. https://doi.org/10.3390/info16121075
Chicago/Turabian StyleLeadon, Tito, and Karim Elish. 2025. "RepackDroid: An Efficient Detection Model for Repackaged Android Applications" Information 16, no. 12: 1075. https://doi.org/10.3390/info16121075
APA StyleLeadon, T., & Elish, K. (2025). RepackDroid: An Efficient Detection Model for Repackaged Android Applications. Information, 16(12), 1075. https://doi.org/10.3390/info16121075

