Distributed Detection of Malicious Android Apps While Preserving Privacy Using Federated Learning
Abstract
:1. Introduction
- Application of privacy-preserving FL is feasible in detecting malignant Android apps in a distributed fashion in actual cellular networks.
- Fewer global training rounds for the FedAvg algorithm [15] typically translates into increased efficiency.
- However, we could not observe any clear correlation in training efficiency with changes in client availability and/or local training intensity. Rather, the biggest factor affecting efficiency is the communication overhead of updating model parameters.
2. Related Work
2.1. Signature-Based Detection
2.2. Behavior-Based Detection
3. A Privacy-Preserving Cross-Silo Federated Learning Framework
3.1. System Model
3.2. Parameter Update
3.3. Federation Algorithm
4. Experimental Setup
4.1. Experiment Environment and Dataset
4.1.1. Android App Imaging
- AndroidManifest.xml: This file is the first file read when running the application. It stores application-essential information, such as components, hardware capabilities, and user rights.
- classes.dex: Dalvik opcodes compiled to be executable on the Dalvik virtual machine.
- resources.arsc: These are xml files compiled into binaries that are necessary for APK execution.
4.1.2. Separation of Training/Testing Data
4.1.3. Data Distribution over Edge Entities
- IID: The training samples are shuffled, and then samples are allocated to each entity.
- Non-IID: First, we sort the data by label (android app type), divide them into 200 shards at a size of 70, and allocate shards to each K edge entity. As a result, most edge entities only have training samples for two classes of applications, this is a so-called pathological non-IID partition of the data. Note that IID and non-IID partitions are balanced.
- Non-IID and imbalanced: It is similar to non-IID. First, we sort the data by label and divide them, e.g., into 1400 shards at a size of 10. We allocate at least one shard and a maximum of shards to each K edge entity. Similar to non-IID, it constitutes a pathological non-IID partition of data, and there are many allocation methods that lead to sufficient imbalance.
4.2. FL Training Setup
4.2.1. Deep Learning Networks
- Convolution L1: 3 × 2 convolution with a 5 × 5 kernel, a stride of 1, ReLU activation.
- Max pooling L1: followed by convolution layer 1, with a 2 × 2 kernel and a stride of 2.
- Convolution L2: 6 × 16 convolution with a 5 × 5 kernel, a stride of 1, ReLU activation.
- Max pooling L2: followed by convolution layer 2, the same with max pooling layer 1.
- Fully connected L1: 59,536 input features are connected fully to 120 out features.
- Fully connected L2: 120 input features are connected fully to 84 out features.
- Fully connected L3: 120 input features are connected fully to 5 out features.
- Softmax: performing classification.
4.2.2. FL Hyperparameters
5. Experimental Results
5.1. Feasibility of Applying FL
5.2. Efficiency Gains through Distributed Computing
5.3. Impact of Hyperparameters
6. Discussion
6.1. Is 91% of Test Accuracy Sufficient?
6.2. Countermeasures against Adversarial Attack
7. Conclusions
Funding
Data Availability Statement
Conflicts of Interest
References
- Gartner Research. Market Share: PCs, Ultramobiles and Mobile Phones, All Countries, 4Q21 Update. 2022. Available online: https://www.gartner.com/en/documents/4011646 (accessed on 12 February 2023).
- Kaspersky. IT Threat Evolution in Q2 2022. Mobile Statistics. 2022. Available online: https://securelist.com/it-threat-evolution-in-q2-2022-mobile-statistics/107123/ (accessed on 12 February 2023).
- Vinod, P.; Zemmari, A.; Conti, M. A machine learning based approach to detect malicious android apps using discriminant system calls. Future Gener. Comput. Syst. 2019, 94, 333–350. [Google Scholar] [CrossRef]
- Lee, S.; Kim, S.; Lee, S.; Choi, J.; Yoon, H.; Lee, D.; Lee, J.R. LARGen: Automatic Signature Generation for Malwares Using Latent Dirichlet Allocation. IEEE Trans. Dependable Secur. Comput. 2018, 15, 771–783. [Google Scholar] [CrossRef]
- Drainakis, G.; Katsaros, K.V.; Pantazopoulos, P.; Sourlas, V.; Amditis, A. Federated vs. Centralized Machine Learning under Privacy-elastic Users: A Comparative Analysis. In Proceedings of the 2020 IEEE 19th International Symposium on Network Computing and Applications (NCA), Cambridge, MA, USA, 24–27 November 2020; pp. 1–8. [Google Scholar] [CrossRef]
- Preuveneers, D.; Rimmer, V.; Tsingenopoulos, I.; Spooren, J.; Joosen, W.; Ilie-Zudor, E. Chained Anomaly Detection Models for Federated Learning: An Intrusion Detection Case Study. Appl. Sci. 2018, 8, 2663. [Google Scholar] [CrossRef] [Green Version]
- Goodfellow, I.J.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative Adversarial Networks, 2014. arXiv 2014, arXiv:1406.2661. [Google Scholar]
- Kang, M.; Kim, H.; Lee, S.; Han, S. Resilience against Adversarial Examples: Data-Augmentation Exploiting Generative Adversarial Networks. KSII Trans. Internet Inf. Syst. 2021, 15, 4105–4121. [Google Scholar] [CrossRef]
- Kairouz, P.; McMahan, H.B.; Avent, B.; Bellet, A.; Bennis, M.; Bhagoji, A.N.; Bonawitz, K.; Charles, Z.; Cormode, G.; Cummings, R.; et al. Advances and open problems in federated learning. Found. Trends® Mach. Learn. 2021, 14, 1–210. [Google Scholar] [CrossRef]
- Smith, V.; Chiang, C.K.; Sanjabi, M.; Talwalkar, A.S. Federated multi-task learning. Adv. Neural Inf. Process. Syst. 2017, 30, 4427–4437. [Google Scholar]
- Zhao, Y.; Li, M.; Lai, L.; Suda, N.; Civin, D.; Chandra, V. Federated learning with non-iid data. arXiv 2018, arXiv:1806.00582. [Google Scholar] [CrossRef]
- Criado, M.F.; Casado, F.E.; Iglesias, R.; Regueiro, C.V.; Barro, S. Non-IID data and Continual Learning processes in Federated Learning: A long road ahead. Inf. Fusion 2022, 88, 263–280. [Google Scholar] [CrossRef]
- Li, X.; Huang, K.; Yang, W.; Wang, S.; Zhang, Z. On the Convergence of FedAvg on Non-IID Data, 2019. arXiv 2019, arXiv:1907.02189. [Google Scholar]
- Wang, H.; Sievert, S.; Liu, S.; Charles, Z.; Papailiopoulos, D.; Wright, S. Atomo: Communication-efficient learning via atomic sparsification. Adv. Neural Inf. Process. Syst. 2018, 31, 9872–9883. [Google Scholar]
- McMahan, B.; Moore, E.; Ramage, D.; Hampson, S.; y Arcas, B.A. Communication-efficient learning of deep networks from decentralized data. In Proceedings of the Artificial Intelligence and Statistics, PMLR, Fort Lauderdale, FL, USA, 20–22 April 2017; pp. 1273–1282. [Google Scholar]
- Wang, Z. Deep learning-based intrusion detection with adversaries. IEEE Access 2018, 6, 38367–38384. [Google Scholar] [CrossRef]
- Huang, C.H.; Lee, T.H.; Chang, L.h.; Lin, J.R.; Horng, G. Adversarial attacks on SDN-based deep learning IDS system. In Proceedings of the International Conference on Mobile and Wireless Technology, Hongkong, China, 25–27 June 2018; Springer: Berlin/Heidelberg, Germany, 2018; pp. 181–191. [Google Scholar]
- Schultz, M.G.; Eskin, E.; Zadok, F.; Stolfo, S.J. Data mining methods for detection of new malicious executables. In Proceedings of the 2001 IEEE Symposium on Security and Privacy. S&P 2001, Oakland, CA, USA, 14–16 May 2001; IEEE: Piscataway, NJ, USA, 2000; pp. 38–49. [Google Scholar]
- Kong, D.; Yan, G. Discriminant malware distance learning on structural information for automated malware classification. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Chicago, FL, USA, 11–14 August 2013; pp. 1357–1365. [Google Scholar]
- Li, Q.; Li, X. Android malware detection based on static analysis of characteristic tree. In Proceedings of the 2015 International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery, Xi’an, China, 17–19 September 2015; IEEE: Piscataway, NJ, USA, 2015; pp. 84–91. [Google Scholar]
- Santos, I.; Brezo, F.; Ugarte-Pedrero, X.; Bringas, P.G. Opcode sequences as representation of executables for data-mining-based unknown malware detection. Inf. Sci. 2013, 231, 64–82. [Google Scholar] [CrossRef]
- Ni, S.; Qian, Q.; Zhang, R. Malware identification using visualization images and deep learning. Comput. Secur. 2018, 77, 871–885. [Google Scholar] [CrossRef]
- Nataraj, L.; Karthikeyan, S.; Jacob, G.; Manjunath, B.S. Malware images: Visualization and automatic classification. In Proceedings of the 8th International Symposium on Visualization for Cyber Security, Pittsburgh, PA, USA, 20 July 2011; pp. 1–7. [Google Scholar]
- Han, K.S.; Lim, J.H.; Kang, B.; Im, E.G. Malware analysis using visualized images and entropy graphs. Int. J. Inf. Secur. 2015, 14, 1–14. [Google Scholar] [CrossRef]
- Bayer, U.; Comparetti, P.M.; Hlauschek, C.; Kruegel, C.; Kirda, E. Scalable, behavior-based malware clustering. In Proceedings of the NDSS, San Diego, CA, USA, 11–16 February 2009; Volume 9, pp. 8–11. [Google Scholar]
- Anderson, B.; Quist, D.; Neil, J.; Storlie, C.; Lane, T. Graph-based malware detection using dynamic analysis. J. Comput. Virol. 2011, 7, 247–258. [Google Scholar] [CrossRef]
- Fujino, A.; Murakami, J.; Mori, T. Discovering similar malware samples using API call topics. In Proceedings of the 2015 12th Annual IEEE Consumer Communications and Networking Conference (CCNC), Las Vegas, NV, USA, 9–12 January 2015; IEEE: Piscataway, NJ, USA, 2015; pp. 140–147. [Google Scholar]
- Arivazhagan, M.G.; Aggarwal, V.; Singh, A.K.; Choudhary, S. Federated learning with personalization layers. arXiv 2019, arXiv:1912.00818. [Google Scholar]
- Li, T.; Sahu, A.K.; Zaheer, M.; Sanjabi, M.; Talwalkar, A.; Smith, V. Federated optimization in heterogeneous networks. Proc. Mach. Learn. Syst. 2020, 2, 429–450. [Google Scholar]
- Mothukuri, V.; Parizi, R.M.; Pouriyeh, S.; Huang, Y.; Dehghantanha, A.; Srivastava, G. A survey on security and privacy of federated learning. Future Gener. Comput. Syst. 2021, 115, 619–640. [Google Scholar] [CrossRef]
- Dwork, C.; Kenthapadi, K.; McSherry, F.; Mironov, I.; Naor, M. Our data, ourselves: Privacy via distributed noise generation. In Proceedings of the Advances in Cryptology-EUROCRYPT 2006: 24th Annual International Conference on the Theory and Applications of Cryptographic Techniques, Proceedings 25, St. Petersburg, Russia, 28 May–1 June 2006; Springer: Berlin/Heidelberg, Germany, 2006; pp. 486–503. [Google Scholar]
- Wang, C.; Wu, X.; Liu, G.; Deng, T.; Peng, K.; Wan, S. Safeguarding cross-silo federated learning with local differential privacy. Digit. Commun. Netw. 2022, 8, 446–454. [Google Scholar] [CrossRef]
- Shokri, R.; Shmatikov, V. Privacy-preserving deep learning. In Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security, Denver, CO, USA, 12–16 October 2015; pp. 1310–1321. [Google Scholar]
- Chen, J.; Pan, X.; Monga, R.; Bengio, S.; Jozefowicz, R. Revisiting distributed synchronous SGD. arXiv 2016, arXiv:1604.00981. [Google Scholar]
- Mahdavifar, S.; Kadir, A.F.A.; Fatemi, R.; Alhadidi, D.; Ghorbani, A.A. Dynamic android malware category classification using semi-supervised deep learning. In Proceedings of the 2020 IEEE Intl Conf on Dependable, Autonomic and Secure Computing, Intl Conf on Pervasive Intelligence and Computing, Intl Conf on Cloud and Big Data Computing, Intl Conf on Cyber Science and Technology Congress (DASC/PiCom/CBDCom/CyberSciTech), Calgary, AB, Canada, 17–22 August 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 515–522. [Google Scholar]
- Zizzo, G.; Rawat, A.; Sinn, M.; Buesser, B. FAT: Federated Adversarial Training, 2020. arXiv 2012, arXiv:12012.01791. [Google Scholar]
Notation | Values | Meaning | Remarks |
---|---|---|---|
C | 0.1, 0.2∼1 | The fraction of edge entities | |
E | 10, 20, 30 | The number of local epochs | |
K | 100 | Number of users | Not controlled |
B | 10 | Local batch size | Not controlled |
N/A | 0.01 | learning rate | Not controlled |
N/A | 0.5 | SGD momentum | Not controlled |
Scheme | C | E | # of Training Round | Training Time | Speed-Ups |
---|---|---|---|---|---|
Baseline (FedSGD) | - | - | 84 | 9 h 6 m 20.81 s | - |
FL(IID) | 0.1 | 10 | 203 | 5 h 9 m 32.50 s | 1.76× |
0.2 | 131 | 5 h 53 m 16.95 s | 1.54× | ||
0.3 | 150 | 10 h 6 m 46.96 s | 0.90× | ||
0.1 | 10 | 203 | 5 h 9 m 32.50 s | 1.76× | |
20 | 133 | 2 h 59 m 20.28 s | 3.04× | ||
30 | 95 | 2 h 8 m 05.92 s | 4.27× | ||
FL(Non-IID) | 0.1 | 10 | 253 | 5 h 58 m 12.25 s | 1.52× |
0.2 | 189 | 8 h 29 m 41.85 s | 1.07× | ||
0.3 | 144 | 9 h 42 m 30.69 s | 0.93× | ||
0.1 | 10 | 253 | 5 h 58 m 12.25 s | 1.52× | |
20 | 195 | 4 h 22 m 56.35 s | 2.08× | ||
30 | 119 | 2 h 40 m 27.62 s | 3.40× | ||
FL(Non-IID-imbalanced) | 0.1 | 10 | 146 | 3 h 16 m 52.04 s | 2.77× |
0.2 | 129 | 5 h 47 m 53.33 s | 1.57× | ||
0.3 | 189 | 12 h 44 m 32.78 s | 0.71× | ||
0.1 | 10 | 146 | 3 h 16 m 52.04 s | 2.77× | |
20 | 124 | 2 h 47 m 12.14 s | 3.27× | ||
30 | 151 | 3 h 23 m 36.56 s | 2.68× |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Lee, S. Distributed Detection of Malicious Android Apps While Preserving Privacy Using Federated Learning. Sensors 2023, 23, 2198. https://doi.org/10.3390/s23042198
Lee S. Distributed Detection of Malicious Android Apps While Preserving Privacy Using Federated Learning. Sensors. 2023; 23(4):2198. https://doi.org/10.3390/s23042198
Chicago/Turabian StyleLee, Suchul. 2023. "Distributed Detection of Malicious Android Apps While Preserving Privacy Using Federated Learning" Sensors 23, no. 4: 2198. https://doi.org/10.3390/s23042198
APA StyleLee, S. (2023). Distributed Detection of Malicious Android Apps While Preserving Privacy Using Federated Learning. Sensors, 23(4), 2198. https://doi.org/10.3390/s23042198