Toward Bulk Synchronous Parallel-Based Machine Learning Techniques for Anomaly Detection in High-Speed Big Data Networks
Abstract
:1. Introduction
- We begin by providing a concise introduction to the changing dynamics in the field of intrusion detection.
- Then identify the four vital characteristics to be considered appropriately while devising network anomaly detection system.
- Propose anomaly detection system based on the suggested principles.
- Perform feature ranking and selection using information gain (IG) and automated branch-and-bound (ABB) algorithms, respectively.
- Implement logistic regression (LR) and eXtreme gradient boosting (XGBoost) techniques for classifying network traffic.
- Employ an emerging and powerful big data computing framework based on bulk synchronous parallel (BSP) processing.
- Evaluate the proposed system to verify its efficacy using ISCX-UNB dataset, which adequately represents network traffic patterns, and also highlight the significance of using appropriate datasets in anomaly detection domain.
2. Background and Related Work
3. Utilizing Machine Learning and Bulk Synchronous Parallel Computing Techniques
4. Proposed Framework
4.1. Input
4.2. Analysis
4.2.1. Data Preprocessing
4.2.2. Feature Ranking and Selection
Algorithm 1. Automated branch-and-bound feature selection |
Input: S—Training dataset D with features Xi |
where i = 1, 2, 3…N |
Q—An empty queue, S1, S2—temporary subsets |
U—Evaluation measure (inconsistency) |
Output: A selected feature subset S |
1: initialize L = {S} |
2: α = U (S2, D) |
2: ABB(S, D)/*Main function reading all features in data D */ |
3: do |
4: S1 = S − X; |
5: add S1 to Q; |
6: while Q is NOT empty |
7: S2 = delete from Q; |
8: if (S2 is legitimate and U (S2, D) ≤ α) |
9 Append S2 in L; |
10 ABB (S2, D) |
11: while (i = 1 to N); |
12: Return the minimum subset; |
4.2.3. BSP-based Machine Learning Classifiers
4.2.4. Attack Recognition
4.3. Output
5. Implementation and Evaluation
5.1. Dataset and Experimental Setup
5.2. Performance Evaluation
- Accuracy has traditionally been considered the most important performance evaluation metric. In the proposed system, accuracy was expressed as the percentage of true IDS predictions,
- Detection rate (DR) represents the percentage of correctly classified intrusions or attacks compared with the total number of intrusions,
- False positive rate (FPR) represents the percentage of normal flows incorrectly classified as intrusions compared with the total number of normal flows,
5.3. Results and Discussion
6. Conclusions
Acknowledgments
Author Contributions
Conflicts of Interest
References
- Heady, R.; Luger, G.F.; Maccabe, A.; Servilla, M. The Architecture of a Network Level Intrusion Detection System; Technical Report; Department of Computer Science, College of Engineering, University of New Mexico: Albuquerque, NM, USA, 15 August 1990. [Google Scholar]
- Kim, D.S.; Park, J.S. Network-based intrusion detection with support vector machines. In Information Networking; Springer: Berlin/Heidelberg, Germany, 2003; pp. 747–756. [Google Scholar]
- Tsai, C.F.; Hsu, Y.F.; Lin, C.Y.; Lin, W.Y. Intrusion detection by machine learning: A review. Expert Syst. Appl. 2009, 36, 11994–12000. [Google Scholar] [CrossRef]
- Kim, D.Y.; Jeong, Y.S.; Kim, S. Data-filtering system to avoid total data distortion in IoT networking. Symmetry 2017, 9, 16. [Google Scholar] [CrossRef]
- Azad, C.; Jha, V.K. Data mining in intrusion detection: A comparative study of methods, types and data sets. Int. J. Inf. Technol. Comput. Sci. 2013, 5, 75. [Google Scholar] [CrossRef]
- Buczak, A.L.; Guven, E. A survey of data mining and machine learning methods for cyber security intrusion detection. IEEE Commun. Surv. Tutor. 2016, 18, 1153–1176. [Google Scholar] [CrossRef]
- Suthaharan, S. Big data classification: Problems and challenges in network intrusion prediction with machine learning. ACM SIGMETRICS Perform. Eval. Rev. 2014, 41, 70–73. [Google Scholar] [CrossRef]
- Whitworth, J.; Suthaharan, S. Security problems and challenges in a machine learning-based hybrid big data processing network systems. ACM SIGMETRICS Perform. Eval. Rev. 2014, 41, 82–85. [Google Scholar] [CrossRef]
- Lee, Y.; Lee, Y. Toward scalable internet traffic measurement and analysis with hadoop. ACM SIGCOMM Comput. Commun. Rev. 2013, 43, 5–13. [Google Scholar] [CrossRef]
- Grahn, K.; Westerlund, M.; Pulkkis, G. Analytics for Network Security: A Survey and Taxonomy. In Information Fusion for Cyber-Security Analytics; Springer: New York, NY, USA, 2017; pp. 175–193. [Google Scholar]
- Wang, L.; Jones, R. Big data analytics for network intrusion detection: A survey. Int. J. Netw. Commun. 2017, 7, 24–31. [Google Scholar]
- Zikopoulos, P.; Eaton, C. Understanding Big Data: Analytics for Enterprise Class Hadoop and Streaming Data; McGraw-Hill Osborne Media: New York, NY, USA, 2012. [Google Scholar]
- Manzoor, M.A.; Morgan, Y. Network intrusion detection system using apache storm. Adv. Sci. Technol. Eng. Syst. J. 2017, 2, 812–818. [Google Scholar] [CrossRef]
- Rathore, M.M.; Ahmad, A.; Paul, A. Real time intrusion detection system for ultra-high-speed big data environments. J. Supercomput. 2016, 72, 3489–3510. [Google Scholar] [CrossRef]
- Janeja, V.P.; Azari, A.; Namayanja, J.M.; Heilig, B. B-dids: Mining anomalies in a big-distributed intrusion detection system. In Proceedings of the 2014 IEEE International Conference on Big Data (Big Data), Washington, DC, USA, 27–30 October 2014; pp. 32–34. [Google Scholar]
- Shiravi, A.; Shiravi, H.; Tavallaee, M.; Ghorbani, A.A. Toward developing a systematic approach to generate benchmark datasets for intrusion detection. Comput. Secur. 2012, 31, 357–374. [Google Scholar] [CrossRef]
- Anderson, J.P. Computer Security Threat Monitoring and Surveillance; Technical Report; James P. Anderson Company: Fort Washington, PA, USA, 1980; Volume 17. [Google Scholar]
- Pontarelli, S.; Bianchi, G.; Teofili, S. Traffic-aware design of a high-speed FPGA network intrusion detection system. IEEE Trans. Comput. 2013, 62, 2322–2334. [Google Scholar] [CrossRef]
- Asosheh, A.; Ramezani, N. A comprehensive taxonomy of DDOS attacks and defense mechanism applying in a smart classification. WSEAS Trans. Comput. 2008, 7, 281–290. [Google Scholar]
- Axelsson, S. Intrusion Detection Systems: A Survey and Taxonomy; Technical Report; Department of Computer Engineering, Chalmers University of Technology: Göteborg, Sweden, March 2000; Volume 99. [Google Scholar]
- Apache Hadoop. Available online: https://hadoop.apache.org/ (accessed on 16 August 2017).
- Apache Spark. Available online: https://spark.apache.org/ (accessed on 16 August 2017).
- Apache Storm. Available online: https://storm.apache.org/ (accessed on 16 August 2017).
- DARPA Intrusion Detection Datasets. Available online: https://www.ll.mit.edu/ideval/data/index.html (accessed on 16 August 2017).
- KDD Cup 1999 Data. Available online: https://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html (accessed 16 on August 2017).
- Tavallaee, M.; Bagheri, E.; Lu, W.; Ghorbani, A.A. A detailed analysis of the KDD CUP 99 data set. In Proceedings of the CISDA 2009 IEEE Symposium on Computational Intelligence for Security and Defense Applications, Ottawa, ON, Canada, 8–10 July 2009. [Google Scholar]
- Karim, I.; Vien, Q.T.; Le, T.A.; Mapp, G. A comparative experimental design and performance analysis of Snort-based Intrusion Detection System in practical computer networks. Computers 2017, 6, 6. [Google Scholar] [CrossRef]
- Bul’ajoul, W.; James, A.; Pannu, M. Improving network intrusion detection system performance through quality of service configuration and parallel technology. J. Comput. Syst. Sci. 2015, 81, 981–999. [Google Scholar] [CrossRef]
- Vasiliadis, G.; Polychronakis, M.; Ioannidis, S. MIDeA: A multi-parallel intrusion detection architecture. In Proceedings of the 18th ACM Conference on Computer and Communications Security, Chicago, IL, USA, 17–21 October 2011; pp. 297–308. [Google Scholar]
- Tan, Z.; Nagar, U.T.; He, X.; Nanda, P.; Liu, R.P.; Wang, S.; Hu, J. Enhancing big data security with collaborative intrusion detection. IEEE Cloud Comput. 2014, 1, 27–33. [Google Scholar] [CrossRef]
- Marchal, S.; Jiang, X.; State, R.; Engel, T. A big data architecture for large scale security monitoring. In Proceedings of the 2014 IEEE International Congress on Big Data (BigData Congress), Anchorage, AK, USA, 27 June–2 July 2014; pp. 56–63. [Google Scholar]
- MAWI Working Group Traffic Archive. Available online: http://mawi.wide.ad.jp/mawi/ (accessed on 16 August 2017).
- Bhuyan, M.H.; Bhattacharyya, D.K.; Kalita, J.K. Towards generating real-life datasets for network intrusion detection. IJ Netw. Secur. 2015, 17, 683–701. [Google Scholar]
- The UNSW-NB15 Dataset. Available online: https://www.unsw.adfa.edu.au/australian-centre-for-cyber-security/cybersecurity/ADFA-NB15-Datasets/ (accessed on 16 August 2017).
- Moustafa, N.; Slay, J. The evaluation of network anomaly detection systems: Statistical analysis of the UNSW-NB15 data set and the comparison with the KDD99 data set. Inf. Secur. J. 2016, 25, 18–31. [Google Scholar] [CrossRef]
- Big Data Working Group. Big Data Analytics for Security Intelligence. September 2013. Available online: https://downloads.cloudsecurityalliance.org/initiatives/bdwg/Big_Data_Analytics_for_Security_Intelligence.pdf (accessed on 16 August 2017).
- Kalavri, V.; Vlassov, V. Mapreduce: Limitations, optimizations and open issues. In Proceedings of the 2013 12th IEEE International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom), Melbourne, VIC, Australia, 16–18 July 2013; pp. 1031–1038. [Google Scholar]
- Apache Hama. Available online: https://hama.apache.org/ (accessed on 16 August 2017).
- Valiant, L.G. A bridging model for parallel computation. Commun. ACM 1990, 33, 103–111. [Google Scholar] [CrossRef]
- Siddique, K.; Akhtar, Z.; Kim, Y.; Jeong, Y.S.; Yoon, E.J. Investigating Apache Hama: A bulk synchronous parallel computing framework. J. Supercomput. 2017, 73, 1–16. [Google Scholar] [CrossRef]
- Siddique, K.; Akhtar, Z.; Yoon, E.J.; Jeong, Y.S.; Dasgupta, D.; Kim, Y. Apache Hama: An emerging bulk synchronous parallel computing framework for big data applications. IEEE Access 2016, 4, 8879–8887. [Google Scholar] [CrossRef]
- Jakovits, P.; Srirama, S.N.; Kromonov, I. Viability of the bulk synchronous parallel model for science on cloud. In Proceedings of the 2013 International Conference on High Performance Computing and Simulation (HPCS), Helsinki, Finland, 1–5 July 2013; pp. 41–48. [Google Scholar]
- Hosmer, D.W., Jr.; Lemeshow, S.; Sturdivant, R.X. Applied Logistic Regression; John Wiley & Sons: Hoboken, NJ, USA, 2013; Volume 398. [Google Scholar]
- Chen, T.; Guestrin, C. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM Sigkdd International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar]
- Liu, H.; Motoda, H. Instance Selection and Construction for Data Mining; Kluwer Academic Publishers: Norwell, MA, USA, 2001. [Google Scholar]
- James, G.; Witten, D.; Hastie, T. An Introduction to Statistical Learning: With Applications in R; Springer: New York, NY, USA, 2013; Volume 103. [Google Scholar]
- Friedman, J. Greedy function approximation: A gradient boosting machine. Ann. Stat. 2001, 29, 1189–1232. [Google Scholar] [CrossRef]
- Li, P. Robust Logitboost and Adaptive Base Class (ABC) Logitboost. arXiv, 2012; arXiv:1203.3491. [Google Scholar]
- XGBoost. Available online: https://github.com/dmlc/xgboost (accessed on 16 August 2017).
Dimension | Description | Challenges and Issues |
---|---|---|
Volume | The size of the datasets |
|
Velocity | The speed at which the data is being generated |
|
Variety | The complexity of data |
|
Veracity | Refers to the trustworthiness of the data in terms of accuracy |
|
Solution | Description | Intrusion Scope | Attack Types |
---|---|---|---|
Firewall | A system designed to stop unauthorized access. | External | IP spoofing, eavesdropping, DoS, port scan, and fragmentation attacks |
Access control | A system that controls or limit illegal access. | External | Unauthorized access, password attacks, dictionary attacks, rainbow table attacks, and sniffer attacks |
Cryptography | To stop the coding or decoding of secret messages. | External | Man-in-the-middle attacks, brute force attacks, and birthday attacks |
IDS | A system that controls and monitors a network or a system. | Internal & External | DoS, DDoS, U2R, port scanning, and flooding |
Serial No. | Feature Rank | Feature Name | Type | Description |
---|---|---|---|---|
1 | f5 | totalSourcePackets | Numeric | Total number of packets transmitted from source to destination |
2 | f4 | totalDestinationPackets | Numeric | Total number of packets transmitted from destination to source |
3 | f9 | direction | Text | Direction of the flow e.g., L2L, L2R etc. |
4 | f13 | protocolName | Text | Type of the protocol, e.g., tcp, udp, etc. |
5 | f12 | source | Text | Source IP |
6 | f15 | destination | Text | Destination IP |
7 | f17 | startDateTime | Date | Start timestamp of the connection |
8 | f18 | stopDateTime | Date | Stop timestamp of the connection |
Dataset/Network Flows | Logistic Regression | XGBoost | ||||
---|---|---|---|---|---|---|
DR | FPR | Accuracy | DR | FPR | Accuracy | |
ISCX-UNB-Saturday | 98.09 | 0.18 | 99.15 | 99.49 | 0.13 | 99.78 |
ISCX-UNB-Monday | 99.39 | 0.58 | 99.53 | 99.37 | 0.35 | 99.34 |
ISCX-UNB-Tuesday | 98.56 | 0.67 | 98.99 | 98.99 | 0.29 | 99.69 |
ISCX-UNB-Wednesday | 99.11 | 0.45 | 99.26 | 99.48 | 0.58 | 99.68 |
ISCX-UNB-Thursday | 99.23 | 0.39 | 99.44 | 99.69 | 0.16 | 99.79 |
© 2017 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Siddique, K.; Akhtar, Z.; Lee, H.-g.; Kim, W.; Kim, Y. Toward Bulk Synchronous Parallel-Based Machine Learning Techniques for Anomaly Detection in High-Speed Big Data Networks. Symmetry 2017, 9, 197. https://doi.org/10.3390/sym9090197
Siddique K, Akhtar Z, Lee H-g, Kim W, Kim Y. Toward Bulk Synchronous Parallel-Based Machine Learning Techniques for Anomaly Detection in High-Speed Big Data Networks. Symmetry. 2017; 9(9):197. https://doi.org/10.3390/sym9090197
Chicago/Turabian StyleSiddique, Kamran, Zahid Akhtar, Haeng-gon Lee, Woongsup Kim, and Yangwoo Kim. 2017. "Toward Bulk Synchronous Parallel-Based Machine Learning Techniques for Anomaly Detection in High-Speed Big Data Networks" Symmetry 9, no. 9: 197. https://doi.org/10.3390/sym9090197