Spyware Identification for Android Systems Using Fine Trees
Abstract
:1. Introduction
2. Related Work
3. Identification, Modeling, and Evaluation
3.1. The Identification Model
- The Dataset: The used dataset (Spyware-Android 2022) [20] includes network traffic data for the most advanced spyware tools used for android, including MobileSPY and FlexSPY, in addition to the normal traffic samples. The dataset comprises seven features (sequence number, duration time, source address, destination address, target protocol type, traffic length, and additional information about the traffic behavior) and one target class (normal, MobileSPY, and FlexSPY). This dataset focuses on spyware systems that share a similar installation process, which was followed according to the instructions provided by the manufacturers. The data collection process involves both the spyware package information and transaction data. A crucial aspect of preparing the dataset for this research was creating a high-quality benchmark, which was accomplished by evaluating the dataset based on established criteria. The benchmark must be objectively interpreted, comparable, and repeatable to ensure validity. The usefulness of benchmarks is maximized when they accurately reflect real-world scenarios. Then the machine learning model was trained using this dataset. So, its accuracy is, in fact, its real-world performance matrix. Furthermore, this research is very novel and focused on detecting spyware for android. The existing security systems are very concerned about spyware, specifically its detection, let alone android spyware; this model would feature detecting android spyware using the random forest machine learning model with very reasonable accuracy. Such a model has never been proposed in the literature. This reach adds a commendable dataset and a model for accurately classifying android spyware to the existing security systems. The dataset used for this experiment was acquired from the most common spyware applications that are available commercially. All the spyware’s features were activated. The dataset was recorded using a packet sniffer tool that operated on android. The tool used was PCAPDroid. The dataset contains data in CSV as well as PCAP format. The data has three classes: class A has the normal traffic data; class B has the instances of spyware installation traffic; and class C has the typical spyware traffic.
- The preprocessing stage: In this stage, we have checked the validity of all data samples, fixed all errors in the data records, encoded all categorical data, compensated all null values with zeros, and removed all duplications. We have also integrated all samples from the dataset for each class into one common file in a randomized manner in order to be trained at the next stages of the model.
- The training module: In this module, in order to build up a comparative study, we have constructed the training model using three different machine learning methods, including fine decision trees (FDT), support vector machines (SVM) [30], and the naïve Bayes classifier (NBC). The three models have been established and trained using 75% of the samples in the overall dataset. The remaining 25% of the samples have been used to test (validate) the model’s predictability for unseen data samples. Also, 5-fold cross-validation has been used at the validation stage to ensure an efficient validation procedure [31].
- The evaluation process is conducted to measure the model’s performance in spyware identification for android systems using the different training models (FDT, SVM, and NBC) in terms of accuracy, precision, and sensitivity. This will end up with comparative results between the three machine learning models, of which one is selected as the best-performing model.
3.2. The System Evaluation
4. Conclusions and Future Work
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Grimmelmann, J. Spyware vs. Spyware: Software Conflicts and User Autonomy. Ohio St. Tech. LJ. 2020, 16, 25. [Google Scholar]
- Albulayhi, K.; Smadi, A.A.; Sheldon, F.T.; Abercrombie, R.K. IoT Intrusion Detection Taxonomy, Reference Architecture, and Analyses. Sensors 2021, 21, 6432. [Google Scholar] [CrossRef] [PubMed]
- Smadi, A.A.; Ajao, B.T.; Johnson, B.K.; Lei, H.; Chakhchoukh, Y.; Abu Al-Haija, Q. A Comprehensive Survey on Cyber-Physical Smart Grid Testbed Architectures: Requirements and Challenges. Electronics 2021, 10, 1043. [Google Scholar] [CrossRef]
- Kaspersky. Spyware Definition. 2022. Available online: https://www.kaspersky.com/resource-center/threats/spyware (accessed on 14 January 2023).
- Gilski, P.; Stefanski, J. Android os: A review. Tem J. 2015, 4, 116. [Google Scholar]
- Yadav, C.S.; Singh, J.; Yadav, A.; Pattanayak, H.S.; Kumar, R.; Khan, A.A.; Haq, M.A.; Alhussen, A.; Alharby, S. Malware Analysis in IoT & Android Systems with Defensive Mechanism. Electronics 2022, 11, 2354. [Google Scholar] [CrossRef]
- Android Authority. Opportunity Powered by Choice. 2020. Available online: https://www.android.com/everyone/enabling-opportunity/ (accessed on 14 January 2023).
- Al-Haija, Q.A.; Saleh, E.; Alnabhan, M. Detecting Port Scan Attacks Using Logistic Regression. In Proceedings of the 4th International Symposium on Advanced Electrical and Communication Technologies (ISAECT), Alkhobar, Saudi Arabia, 6–8 December 2021; pp. 1–5. [Google Scholar] [CrossRef]
- Laricchia, F. Mobile Operating Systems’ Market Share Worldwide from 1st Quarter 2009 to 4th Quarter 2022. 16 November 2022. Available online: https://www.statista.com/markets/418/topic/481/telecommunications/ (accessed on 14 January 2023).
- Shahzad, R.K.; Haider, S.I.; Lavesson, N. Detection of Spyware by Mining Executable Files. In Proceedings of the 2010 International Conference on Availability, Reliability and Security, Krakow, Poland, 15–18 February 2010; pp. 295–302. [Google Scholar] [CrossRef]
- Abu Al-Haija, Q.; Al-Saraireh, J. Asymmetric Identification Model for Human-Robot Contacts via Supervised Learning. Symmetry 2022, 14, 591. [Google Scholar] [CrossRef]
- Shatnawi, A.S.; Jaradat, A.; Yaseen, T.B.; Taqieddin, E.; Al-Ayyoub, M.; Mustafa, D. An Android Malware Detection Leveraging Machine Learning. Wirel. Commun. Mob. Comput. 2022, 2022, 1830201. [Google Scholar] [CrossRef]
- Liu, K.; Xu, S.; Xu, G.; Zhang, M.; Sun, D.; Liu, H. A Review of Android Malware Detection Approaches Based on Machine Learning. IEEE Access 2020, 8, 124579–124607. [Google Scholar] [CrossRef]
- Zulkifli, A.; Hamid, I.R.A.; Shah, W.M.; Abdullah, Z. Android Malware Detection Based on Network Traffic Using Decision Tree Algorithm. In Recent Advances on Soft Computing and Data Mining, Proceedings of the Third International Conference on Soft Computing and Data Mining (SCDM 2018), Johor, Malaysia, 6–7 February 2018; Ghazali, R., Deris, M., Nawi, N., Abawajy, J., Eds.; Springer: Cham, Switzerland, 2018; Volume 700. [Google Scholar] [CrossRef]
- Abu Al-Haija, Q.; Odeh, A.; Qattous, H. PDF Malware Detection Based on Optimizable Decision Trees. Electronics 2022, 11, 3142. [Google Scholar] [CrossRef]
- Amro, A.; Al-Akhras, M.; Hindi, K.E.; Habib, M.; Shawar, B.A. Instance reduction for avoiding overfitting in decision trees. J. Intell. Syst. 2021, 30, 438–459. [Google Scholar] [CrossRef]
- Mohamed, W.; Salleh, M.; Omar, A. A comparative study of Reduced Error Pruning method in decision tree algorithms. In Proceedings of the IEEE International Conference on Control System, Computing and Engineering, Penang, Malaysia, 23–25 November 2012; pp. 392–397. [Google Scholar] [CrossRef]
- Pierazzi, F.; Mezzour, G.; Han, Q.; Colajanni, M.; Subrahmanian, V.S. A data-driven characterization of modern Android spyware. ACM Trans. Manag. Inf. Syst. 2020, 11, 1–38. [Google Scholar] [CrossRef]
- Gana, N.N.; Abdulhamid, S.M.; Misra, S.; Garg, L.; Ayeni, F.; Azeta, A. Optimization of Support Vector Machine for Classification of Spyware Using Symbiotic Organism Search for Features Selection. In Information Systems and Management Science. ISMS 2020; Lecture Notes in Networks and Systems; Springer: Cham, Switzerland, 2022; Volume 303. [Google Scholar] [CrossRef]
- Qabalin, M.K.; Naser, M.; Alkasassbeh, M. Android Spyware Detection Using Machine Learning: A Novel Dataset. Sensors 2022, 22, 5765. [Google Scholar] [CrossRef] [PubMed]
- AlJarrah, M.N.; Yaseen, Q.M.; Mustafa, A.M. A Context-Aware Android Malware Detection Approach Using Machine Learning. Information 2022, 13, 563. [Google Scholar] [CrossRef]
- Alshamrani, S. Design and Analysis of Machine Learning Based Technique for Malware Identification and Classification of Portable Document Format Files. Secur. Commun. Netw. 2022, 2022, 7611741. [Google Scholar] [CrossRef]
- Mahesh, S.V. Spyware Detection and Prevention using Deep Learning, A.I. for user applications. Int. J. Recent Technol. Eng. 2019, 7, 345–349. [Google Scholar]
- Abu Al-Haija, Q.; Krichen, M. A Lightweight In-Vehicle Alcohol Detection Using Smart Sensing and Supervised Learning. Computers 2022, 11, 121. [Google Scholar] [CrossRef]
- Lysenko, S.; Bobrovnikova, K.; Popov, P.T.; Kharchenko, V.; Medzatyi, D. Spyware detection technique based on reinforcement learning. In Proceedings of the 1st International Workshop on Intelligent Information Technologies & Systems of Information Security, Khmelnytskyi, Ukraine, 10–12 June 2020; Volume 2623, pp. 307–316. [Google Scholar]
- Anumula, K.; Raymond, J. Adware, and Spyware Detection Using Classification and Association. In Proceedings of the International Conference on Deep Learning, Computing and Intelligence, Chennai, India, 7–8 January 2021; Springer Nature: Singapore, 2022; pp. 355–361. [Google Scholar]
- Fasano, F.; Martinelli, F.; Mercaldo, F.; Nardone, V.; Santone, A. Spyware Detection using Temporal Logic. In Proceedings of the 5th International Conference on Information Systems Security and Privacy, Prague, Czech Republic, 23–25 February 2019; Volume 1, pp. 690–699. [Google Scholar]
- Elmalaki, S.; Ho, B.J.; Alzantot, M.; Shoukry, Y.; Srivastava, M. Spycon: Adaptation based spyware in human-in-the-loop IoT. In Proceedings of the 2019 IEEE Security and Privacy Workshops, San Francisco, CA, USA, 23 May 2019; pp. 163–168. [Google Scholar]
- Suruthi, B.; Yuvasri, G.; Reena, R.; Kapilavani, R.K. Efficient handwritten passwords to overcome spyware attacks. Sci. Technol. 2021, 3, 1–9. [Google Scholar]
- Abu Al-Haija, Q.; Smadi, A.A.; Allehyani, M.F. Meticulously Intelligent Identification System for Smart Grid Network Stability to Optimize Risk Management. Energies 2021, 14, 6935. [Google Scholar] [CrossRef]
- Abu Al-Haija, Q.; Al Badawi, A. High-performance intrusion detection system for networked UAVs via deep learning. Neural Comput. Appl. 2022, 34, 10885–10900. [Google Scholar] [CrossRef]
- Conti, M.; Rigoni, G.; Toffalini, F. ASAINT: A spy App identification system based on network traffic. In Proceedings of the ARES ’20—The 15th International Conference on Availability, Reliability, and Security, Virtual, 25–28 August 2020. [Google Scholar]
- Malik, J.; Kaushal, R. CREDROID: Android malware detection by network traffic analysis. In Proceedings of the PAMCO 2016—2nd MobiHoc International Workshop on Privacy-Aware Mobile Computing, Paderborn, Germany, 5–8 July 2016; pp. 28–36. [Google Scholar]
- Abu Al-Haija, Q.; Al-Dala’ien, M. ELBA-IoT: An Ensemble Learning Model for Botnet Attack Detection in IoT Networks. J. Sens. Actuator Netw. 2022, 11, 18. [Google Scholar] [CrossRef]
- Albulayhi, K.; Abu Al-Haija, Q.; Alsuhibany, S.A.; Jillepalli, A.A.; Ashrafuzzaman, M.; Sheldon, F.T. IoT Intrusion Detection Using Machine Learning with a Novel High Performing Feature Selection Method. Appl. Sci. 2022, 12, 5015. [Google Scholar] [CrossRef]
- Abu Al-Haija, Q.; Zein-Sabatto, S. An Efficient Deep-Learning-Based Detection and Classification System for Cyber-Attacks in IoT Communication Networks. Electronics 2020, 9, 2152. [Google Scholar] [CrossRef]
- Kamran, M.; Rashid, J.; Nisar, M.W. Android fragmentation classification, causes, problems and solutions. Int. J. Comput. Sci. Inf. Secur. 2016, 14, 992. [Google Scholar]
Approach | Advantages | Limitations |
---|---|---|
DL | Can handle large amounts of data and make decisions based on multiple features | Prone to overfitting |
SVM RF | Can handle high-dimensional data and have good generalization ability. Can detect novel variants of spyware | It can be sensitive to the choice of kernel and may require a careful selection of parameters. Accuracy needs to be improved in most cases. |
XGboost D.T. | Can train very fast, saving a lot of time and resources | It does not address the issue of spyware detection specifically |
Static analysis | Can achieve cost accuracy trade-off? | It may not be effective against obfuscated or modified spyware. |
KNN CNN | Can analyze the behavior of an app to identify suspicious activity. Can achieve a very high accuracy | It may not be effective against new or unknown types of spyware. The problem of false positives can raise |
Study | Approach | Dataset | Results |
---|---|---|---|
[18] | DL | 5000 spyware, 5000 goodware, and 5000 other malware (non-spyware) samples | Achieved an accuracy of 98.2% in identifying spyware |
[19] | SVM | 1000 Android apps (500 spyware, 500 benign) | Achieved an accuracy of 97.40% in identifying spyware |
[20] | R.F. | Prepared through packet sniffer, having 386,963 packets captured in 24 files | Achieved an accuracy of 79% and 77% on binary classification and multi-classification, respectively, in identifying spyware |
[21] | D.T. and R.F. | Used CICMalDroid2020 having 16,900 Android samples | Achieved an accuracy of 99.4% on context-aware identification spyware |
[15] | XGboost Gradient Boosted DT | 900 K entries in total (300 K malware, 300 K benign, and 300 K unlabeled) | Achieved an accuracy of 98.5% in identifying spyware |
[11] | incorporated many ML models | 17,394 data points with 279 columns divided into 51 malware families | D.T. achieved an accuracy of 99% in identifying spyware |
[22] | R.F., SVM, and DT | 1200 PDF samples, including malicious and secure files, were divided into 800 and 400 for training and testing. | Achieved an accuracy of 97.8% in identifying malware |
[12] | Static Analysis | The dataset was acquired using Palo Alto networks with malware, benign, and grey ware traffic. | Achieved very high accuracy in identifying spyware |
[23] | DL | The dataset was obtained from Zeltser, and the spyware was in nine families. | Achieved an accuracy of 99.7% in identifying spyware |
Model | Accuracy | Precision | Sensitivity |
---|---|---|---|
FDT | 98.2% | 98.3% | 98.1% |
SVM | 97.7% | 97.5% | 97.3% |
NBC | 93.9% | 94.1% | 91.1% |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Naser, M.; Abu Al-Haija, Q. Spyware Identification for Android Systems Using Fine Trees. Information 2023, 14, 102. https://doi.org/10.3390/info14020102
Naser M, Abu Al-Haija Q. Spyware Identification for Android Systems Using Fine Trees. Information. 2023; 14(2):102. https://doi.org/10.3390/info14020102
Chicago/Turabian StyleNaser, Muawya, and Qasem Abu Al-Haija. 2023. "Spyware Identification for Android Systems Using Fine Trees" Information 14, no. 2: 102. https://doi.org/10.3390/info14020102