PDF Malware Detection Based on Fuzzy Unordered Rule Induction Algorithm (FURIA)
Abstract
:1. Introduction
1.1. Reason for the Selection of PDF Files
1.2. PDF-Based Malware
- To propose a malware detection model that will protect the systems from any harmful activity caused by PDF malware;
- To compare the findings from the suggested and existing models in use to discover a better and more effective solution for PDF malware detection.
- We propose a FURIA-based model for the PDF malware detection;
- We analyze the outcomes of the proposed model with four well-known ML models: NB, J48, HT, and QDA;
- We do several tests on the dataset available at: http://205.174.165.80/CICDataset/CICEvasivePDFMal2022/Dataset/ (accessed on 5 February 2023);
- We disclose the intuition of the experiments using MAE, ACC, FM, MCC, precision, and recall metrics.
2. Literature Review
3. Research Methodology
Fuzzy Unordered Rule Induction Algorithm (FURIA)
Algorithm 1: Generation of single ruler [22]. |
Let A be the set of numeric antecedents of r Compute the best fuzzification of A[i] in terms of purity |
4. Results, Analysis, and Discussion
5. Conclusions and Future Direction
Author Contributions
Funding
Conflicts of Interest
References
- Jeong, Y.S.; Woo, J.; Kang, A.R. Malware Detection on Byte Streams of PDF Files Using Convolutional Neural Networks. Secur. Commun. Netw. 2019, 2019, 8485365. [Google Scholar] [CrossRef] [Green Version]
- Cuan, B.; Damien, A.; Delaplace, C.; Valois, M. Malware detection in PDF files using machine learning. In Proceedings of the ICETE 2018—The 15th International Joint Conference on e-Business and Telecommunications, Warangal, India, 18–21 December 2018; Volume 2, pp. 412–419. [Google Scholar] [CrossRef]
- Falah, A.; Pokhrel, S.R.; Pan, L.; de Souza-Daw, A. Towards enhanced PDF maldocs detection with feature engineering: Design challenges. Multimed. Tools Appl. 2022, 81, 41103–41130. [Google Scholar] [CrossRef]
- Docs, A.D. Adobe. Available online: https://opensource.adobe.com/dc-acrobat-sdk-docs/ (accessed on 21 November 2022).
- Zhang, J. MLPdf: An Effective Machine Learning Based Approach for PDF Malware Detection. arXiv 2018, arXiv:1808.06991. [Google Scholar]
- Malware Analysis on PDF. Available online: https://scholarworks.sjsu.edu/etd_projects/683/ (accessed on 20 May 2019).
- Xu, W.; Qi, Y.; Evans, D. Automatically Evading Classifiers. In Proceedings of the 23rd Annual Network and Distributed System Security Symposium—NDSS ’16, San Diego, CA, USA, 21–24 February 2016; Volume 2016, pp. 21–24. [Google Scholar]
- Chakkaravarthy, S.S.; Sangeetha, D.; Vaidehi, V. A Survey on malware analysis and mitigation techniques. Comput. Sci. Rev. 2019, 32, 1–23. [Google Scholar] [CrossRef]
- Li, W.; Meng, W.; Tan, Z.; Xiang, Y. Design of multi-view based email classification for IoT systems via semi-supervised learning. J. Netw. Comput. Appl. 2019, 128, 56–63. [Google Scholar] [CrossRef] [Green Version]
- Li, Y.; Wang, X.; Shi, Z.; Zhang, R.; Xue, J.; Wang, Z. Boosting training for PDF malware classifier via active learning. Int. J. Intell. Syst. 2022, 37, 2803–2821. [Google Scholar] [CrossRef]
- Kang, A.R.; Jeong, Y.-S.; Kim, S.L.; Woo, J. Malicious PDF detection model against adversarial attack built from benign PDF containing javascript. Appl. Sci. 2019, 9, 4764. [Google Scholar] [CrossRef] [Green Version]
- Chen, Y.; Wang, S.; She, D.; Jana, S. On training robust {PDF} malware classifiers. In Proceedings of the 29th USENIX Security Symposium (USENIX Security 20), Boston, MA, USA, 12–14 August 2020; pp. 2343–2360. [Google Scholar]
- Cova, M.; Kruegel, C.; Vigna, G. Detection and analysis of drive-by-download attacks and malicious JavaScript code. In Proceedings of the 19th International Conference on World Wide Web, Raleigh, CA, USA, 26 April 2010; pp. 281–290. [Google Scholar]
- Laskov, P.; Šrndić, N. Static detection of malicious JavaScript-bearing PDF documents. In Proceedings of the 27th Annual Computer Security Applications Conference, Orlando, FL, USA, 5–9 December 2011; pp. 373–382. [Google Scholar]
- Ryan, C. Automatic Re-Engineering of Software Using Genetic Programming; Kluwer Academic Publishers: Dordrecht, The Netherlands, 2000. [Google Scholar]
- Khitan, S.J.; Hadi, A.; Atoum, J. PDF forensic analysis system using YARA. Int. J. Comput. Sci. Netw. Secur. 2017, 17, 77–85. [Google Scholar]
- Liu, D.; Wang, H.; Stavrou, A. Detecting malicious javascript in pdf through document instrumentation. In Proceedings of the 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, Atlanta, GA, USA, 23–26 June 2014; pp. 100–111. [Google Scholar]
- Smutz, C.; Stavrou, A. Malicious PDF detection using metadata and structural features. In Proceedings of the 28th Annual Computer Security Applications Conference, Orlando, FL, USA, 7 December 2012; pp. 239–248. [Google Scholar]
- Xu, M.; Kim, T. {PlatPal}: Detecting Malicious Documents with Platform Diversity. In Proceedings of the 26th USENIX Security Symposium (USENIX Security 17), Vancouver, BC, USA, 16–18 August 2017; pp. 271–287. [Google Scholar]
- Li, M.; Liu, Y.; Yu, M.; Li, G.; Wang, Y.; Liu, C. FEPDF: A robust feature extractor for malicious PDF detection. In Proceedings of the 2017 IEEE Trustcom/BigDataSE/ICESS, Sydney, Australia, 1–4 August 2017; pp. 218–224. [Google Scholar]
- Scofield, D.; Miles, C.; Kuhn, S. Fast model learning for the detection of malicious digital documents. In Proceedings of the 7th Software Security, Protection, and Reverse Engineering/Software Security and Protection Workshop, San Juan, Puerto Rico, 4–5 December 2017; pp. 1–8. [Google Scholar]
- Hühn, J.; Hüllermeier, E. FURIA: An algorithm for unordered fuzzy rule induction. Data Min. Knowl. Discov. 2009, 19, 293–319. [Google Scholar] [CrossRef] [Green Version]
- Naseem, R.; Khan, B.; Ahmad, A.; Almogren, A.; Jabeen, S.; Hayat, B.; Shah, M.A. Investigating Tree Family Machine Learning Techniques for a Predictive System to Unveil Software Defects. Complexity 2020, 2020, 6688075. [Google Scholar] [CrossRef]
- Khan, B.; Naseem, R.; Shah, M.A.; Wakil, K.; Khan, A.; Uddin, M.I.; Mahmoud, M. Software Defect Prediction for Healthcare Big Data: An Empirical Evaluation of Machine Learning Techniques. J. Healthc. Eng. 2021, 2021, 8899263. [Google Scholar] [CrossRef] [PubMed]
- Gasparovica, M.; Aleksejeva, L. Using Fuzzy Unordered Rule Induction Algorithm for cancer data classification. Breast Cancer 2011, 13, 1229. [Google Scholar]
- Soares, E.; Damascena, L.; Lima, L.M.; Moraes, R.M.D. Analysis of the Fuzzy Unordered Rule Induction Algorithm as a Method for Classification. In Proceedings of the Conference: V Congresso Brasileiro de Sistemas Fuzzy, Fortaleza, Brasil, 4–6 July 2018; pp. 4–6. [Google Scholar]
- Verma, L.; Srivastava, S.; Negi, P.C. Transactional Processing Systems A Hybrid Data Mining Model to Predict Coronary Artery Disease Cases Using Non-Invasive Clinical Data. J. Med. Syst. 2016, 40, 178. [Google Scholar] [CrossRef] [PubMed]
- Ukanova, Z.M.; Udun, K.G.; Lemessova, Z.E.; Hamkhash, L.K.; Alchenko, E.R.; Ukasov, R.B. Detection of Paracetamol in Water and Urea in Artificial Urine with Gold Nanoparticle @Al Foil Cost-efficient SERS Substrate. Anal. Sci. 2018, 34, 183–187. [Google Scholar] [CrossRef] [PubMed] [Green Version]
S No. | Feature | Description |
---|---|---|
1 | Obj | This might be a sign of an attempt to obfuscate. |
2 | endobj | Many other forms of obfuscations are supported by PDFs, including string obfuscations in hex, octal, etc. that are typically used for evasion efforts. |
3 | Stream | This represents the quantity of binary data sequences in the PDF. |
4 | Endstream | Keywords that signify the streams’ termination. |
5 | Xref | Size of the stream because streams may include a dangerous code. |
6 | Trailer | How many trailers there are in the PDF. |
7 | Startxref | How many keywords include “startxref,” which designates the location where the Xref table is begun. |
8 | Pageno | Because malicious PDF files do not care how their material is presented, they often contain fewer pages—often only one blank page. |
9 | Encrypt | This function indicates if a PDF file is password-protected or not. |
10 | Objstm | streams with other items in them. |
11 | JS | The proportion of Javascript-containing objects. |
12 | Javascript | This indicates the amount of items that include a Javascript code, the most often used feature, as is clear. |
13 | AA | specifies a particular response to an event. |
14 | OpenAction | Defines a specific action to be taken when the PDF file is opened. The bulk of common malicious PDF files have been found to use this functionality in conjunction with Javascript. |
15 | Acroform | Form fields in Acrobat forms, which are PDF files, offer scripting technology that may be abused by hackers. |
16 | JBIG2Decode | A popular filter for encoding harmful stuff is JBig2Decode. How many items have nested filters? Nested filters can make decoding more challenging and may be an indicator of evasion. |
17 | Richmeddia | The quantity of flash files and embedded media is indicated by the number of RichMedia keywords. |
18 | Launch | A command or program can be run by using the term launch. |
19 | EmbeddedFile | PDFs can attach or embed a variety of things inside themselves that may be exploited, such as additional PDF files, Word documents, pictures, etc. |
20 | XFA | Certain PDF 40 files contain XFAs, which are XML Form Architectures that offer scripting technologies that can be abused by attackers. |
21 | Color | In the PDF, many colors are utilized. |
22 | Class | Classify as malicious or benign. |
Models | No | Yes | |
---|---|---|---|
FURIA | no | 8995 | 11 |
yes | 27 | 10,953 | |
NB | no | 8807 | 199 |
yes | 97 | 10,883 | |
J48 | no | 8977 | 29 |
yes | 33 | 10,947 | |
HT | no | 8943 | 63 |
yes | 67 | 10,913 | |
QDA | no | 8942 | 64 |
yes | 82 | 10,898 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Mejjaouli, S.; Guizani, S. PDF Malware Detection Based on Fuzzy Unordered Rule Induction Algorithm (FURIA). Appl. Sci. 2023, 13, 3980. https://doi.org/10.3390/app13063980
Mejjaouli S, Guizani S. PDF Malware Detection Based on Fuzzy Unordered Rule Induction Algorithm (FURIA). Applied Sciences. 2023; 13(6):3980. https://doi.org/10.3390/app13063980
Chicago/Turabian StyleMejjaouli, Sobhi, and Sghaier Guizani. 2023. "PDF Malware Detection Based on Fuzzy Unordered Rule Induction Algorithm (FURIA)" Applied Sciences 13, no. 6: 3980. https://doi.org/10.3390/app13063980
APA StyleMejjaouli, S., & Guizani, S. (2023). PDF Malware Detection Based on Fuzzy Unordered Rule Induction Algorithm (FURIA). Applied Sciences, 13(6), 3980. https://doi.org/10.3390/app13063980