Comparative Study between Big Data Analysis Techniques in Intrusion Detection
Abstract
:1. Introduction
2. Literature Review
3. Research Methodology
3.1. Data and Methods
3.1.1. Dataset Description
3.1.2. Apache Spark
3.1.3. Microsoft Azure
3.1.4. Decision Tree Classifier
3.2. Extract, Transform and Prepare
- Exactly once delivery—this means that the message is only delivered once. It is neither getting lost or duplicated.
- Providing end-to-end reliability with Structured Streaming and achieving fault tolerance is done by specifying a checkpoint directory where all metadata is saved. Besides, even if the entire cluster fails, work can be restarted on a new cluster. Spark supports many data sources such as file source, Apache Kafka, socket and rate source [25].
Listing 1. Spark read stream. |
streamingInputDF = (spark.readStream.schema(FSchema). |
option(“header”,True).csv(Source_path_CSV)) |
Listing 2. Spark write stream. |
query = (streamingInputDF.writeStream.format(“parquet”) |
.option(“checkpointLocation”,“/checkpoint_location”) |
.option(“Path”,“/Parquet_Files_Path/”).start()) |
3.3. Real-Time Classification
Listing 3. Loading Machine Learning (ML) model. |
model = PipelineModel.read().load(“Path_to_ML_Model”) |
dflive = model.transform(LiveData) |
Listing 4. Loading ML model. |
query = (dflive.select(‘srcIP’, ‘srcPort’, ‘dstIP’, ‘dstPort’,‘label’, ‘label_ix’,‘prediction’) .writeStream.format(“parquet”).option(“checkpointLocation”,“Checkpoint_location_Path”).option(“Path”,“Predictions_Path”).start()) |
3.4. Machine Learning Pipeline
- The StringIndexer encodes a string column of labels to a column of label indices. The indices are ordered by label frequencies, so the most frequent label gets index 0.
- The VectorAssembler is a transformer that combines a given list of columns in a single vector column. It is useful for combining raw features and features generated by different feature transformers into a single feature vector in order to train Machine Learning models like Logistic Regression and Decision Trees.
4. Experimentation Results
4.1. Experiment Setup
4.2. Performance Metrics
- The first factor is the size of the cluster. Although it still varies with the workload, the more processing power we allocate, the more data the system can process.
- The second factor is the incoming rate of data. If it is more than what the system can process, it creates a bottleneck and an intervention is needed to limit the size of the input rate.
5. Conclusions and Future Work
Author Contributions
Funding
Acknowledgments
Conflicts of Interest
References
- Top-5-Cybersecurity-Concerns-for-2018. Available online: https://www.csoonline.com/article/3241766/cyber-attacks-espionage/top-5-cybersecurity-concerns-for-2018.html (accessed on 23 June 2018).
- Cisco Cybersecurity Reports. Available online: https://www.cisco.com/c/en/us/products/security/security-reports.html#~stickynav=2 (accessed on 9 August 2018).
- Myers, S.; Musacchio, J.; Bao, N. Intrusion Detection Systems: A Feature and Capability Analysis; Baskin School of Engineering: Santa Cruz, CA, USA, 2010. [Google Scholar]
- Stergiou, C.; Psannis, K.E.; Byung-Gyu, K.; Brij, G.B. Secure integration of IoT and Cloud Computing. FuTure Gener. Comput. Syst. 2018, 78, 964–975. [Google Scholar] [CrossRef]
- Apache Hadoop. Available online: www.apache.com/hadoop (accessed on 8 December 2018).
- Apache Spark. Available online: www.apache.com/spark (accessed on 8 December 2018).
- Ar, L.; Levent, E.; Vipin, K.; Aysel, O.; Jaideep, S. A comparative study of anomaly detection schemes in network intrusion detection. In Proceedings of the SIAM Conference on Applications of Dynamical. Systems, Snowbird, UT, USA, 27–31 May 2003. [Google Scholar]
- Massimiliano, A.; Erbacher, R.F.; Jajodia, S.; Persia, M.C.F.; Picariello, A.; Sperli, G.; Subrahmanian, S.V. Recognizing unexplained behavior in network traffic. Netw. Sci. Cybersecur. 2013, 55, 39–62. [Google Scholar]
- Manzoor, M.A.; Morgan, Y. Real-time Support Vector Machine based Network Intrusion Detection system using Apache Storm. In Proceedings of the IEEE 7th Annual Information Technology, Electronics and Mobile Communication Conference (IEMCON), Vancouver, BC, Canada, 13–15 October 2016. [Google Scholar]
- Belouch, M.; el Hadaj, S.; Idhammad, M. Performance evaluation of intrusion detection based on machine learning using Apache Spark. Procedia Comput. Sci. 2018, 127, 1–6. [Google Scholar] [CrossRef]
- Pallaprolu, S.C.; Sankineni, R.; Thevar, M.; Karabatis, G.; Wang, J. Zero-Day Attack Identification in Streaming Data Using Semantics and Spark. In Proceedings of the IEEE International Congress on Big Data (BigData Congress), Honolulu, HI, USA, 25–30 June 2017. [Google Scholar]
- Gupta, G.P.; Kulariya, M. A Framework for Fast and Efficient Cyber Security Network Intrusion Detection Using Apache Spark. Procedia Comput. Sci. 2016, 93, 824–831. [Google Scholar] [CrossRef]
- Terzi, D.S.; Terzi, R.; Sagiroglu, S. Big data analytics for network anomaly detection from netflow data. In Proceedings of the International Conference on Computer Science and Engineering (UBMK), Antalya, Turkey, 5–8 October 2017. [Google Scholar]
- Cisco Systems NetFlow Services Export Version 9. Available online: https://tools.ietf.org/html/rfc3954 (accessed on 2 June 2018).
- Casas, P.; Soro, F.; Vanerio, J.; Settanni, G.; D′Alconzo, A. Network security and anomaly detection with Big-DAMA, a big data analytics framework. In Proceedings of the IEEE 6th International Conference on Cloud Networking (CloudNet), Prague, Czech Republic, 25–27 September 2017. [Google Scholar]
- Callegari, C.; Giordano, S.; Pagano, M. Statistical Network Anomaly Detection: An Experimental Study. In Proceedings of the International Conference on Future Network Systems and Security, Paris, France, 23–25 November 2016. [Google Scholar]
- Fontugne, R.; Borgnat, P.; Abry, P.; Fukuda, K. MAWILab: Combining diverse anomaly detectors for automated anomaly labeling and performance benchmarking. In Proceedings of the International Conference on emerging Networking EXperiments and Technologies (CoNEXT), Philadelphia, PA, USA, 30 November–3 December 2010. [Google Scholar]
- Fukuda Lab. Documentation. Available online: http://www.fukuda-lab.org/mawilab/documentation.html (accessed on 8 December 2018).
- Dataricks. About Databricks. Available online: https://databricks.com/spark/about (accessed on 6 May 2018).
- Zubair, N. Pro Spark Streaming the Zen of Real-Time Analytics Using Apache Spark; Apress: Berkeley, CA, USA, 2016. [Google Scholar]
- Armbrust, M.; Das, T.; Torres, J.; Yavuz, B.; Zhu, S.; Xin, R.; Ghodsi, A.; Stoica, I.; Zaharia, M. Structured Streaming: A Declarative API for Real-Time Applications in Apache Spark. In Proceedings of the International Conference on Management of Data, Houston, TX, USA, 10–15 June 2018. [Google Scholar]
- Real-time Streaming ETL with Structured Streaming in Apache Spark 2.1. Available online: https://databricks.com/blog/2017/01/19/real-time-streaming-etl-structured-streaming-apache-spark-2-1.html (accessed on 6 May 2018).
- Benchmarking Structured Streaming on Databricks Runtime against State-of-the-Art Streaming Systems. Available online: https://databricks.com/blog/2017/10/11/benchmarking-structured-streaming-on-databricks-runtime-against-state-of-the-art-streaming-systems.html (accessed on 7 May 2018).
- Yahoo Streaming Benchmarks. Available online: https://github.com/yahoo/streaming-benchmarks (accessed on 8 December 2018).
- Apache. Structured Streaming Programming Guide. Available online: https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html (accessed on 7 June 2018).
- Microsoft. Azure Regions. Available online: https://azure.microsoft.com/en-us/global-infrastructure/regions/ (accessed on 5 May 2018).
- What is Microsoft Azure and Why Use It? Available online: https://www.sumologic.com/resource/white-paper/what-is-microsoft-azure-and-why-use-it/ (accessed on 5 May 2018).
- Microsoft. Azure Storage Blobs. Available online: https://azure.microsoft.com/en-us/services/storage/blobs/ (accessed on 8 December 2018).
- Microsoft. Azure SDK for PYTHON. Available online: https://github.com/Azure/azure-sdk-for-python (accessed on 8 December 2018).
- Apache Parquet vs. CSV Files—DZone Database. Available online: https://dzone.com/articles/how-to-be-a-hero-with-powerful-parquet-google-and (accessed on 6 February 2018).
- Ullah, F.; Babar, M.A. Architectural Tactics for Big Data Cybersecurity Analytic Systems: A Review. arXiv, 2018; arXiv:1802.03178. [Google Scholar]
- Verma, R.; Kantarcioglu, M.; Marchette, D.; Leiss, E.; Solorio, T. Security Analytics: Essential Data Analytics Knowledge for Cybersecurity Professionals and Students. IEEE Secur. Priv. 2015, 13, 60–65. [Google Scholar] [CrossRef]
- Mllib Evaluation Metrics. Available online: https://spark.apache.org/docs/2.1.0/mllib-evaluation-metrics.html (accessed on 3 June 2018).
- Ivanov, T.; Taaffe, J. Exploratory Analysis of Spark Structured Streaming. In Proceedings of the International Conference on Performance Engineering, Berlin, Germany, 9–13 April 2018. [Google Scholar]
- Gaied, I.; Jemili, F.; Korbaa, O. Intrusion detection based on Neuro-Fuzzy classification. In Proceedings of the 2015 IEEE/ACS 12th International Conference of Computer Systems and Applications (AICCSA), Marrakech, Morocco, 17–20 November 2015. [Google Scholar]
- Essid, M.; Jemili, F. Combining intrusion detection datasets using MapReduce. In Proceedings of the 2016 IEEE International Conference on Systems, Man, and Cybernetics (SMC), Budapest, Hungary, 9–12 October 2016. [Google Scholar]
- Li, Z. A Neural Network Based Distributed Intrusion Detection Sysem on Cloud Platform; The University of Toledo: Toledo, OH, USA, 2013. [Google Scholar]
Code | Anomaly |
---|---|
1 | Sasser worm |
2 | NetBIOS attack |
3 | Remote Procedure Call (RPC) attack |
4 | Server Message Block (SMB) attack |
10 | SYN attack |
11 | Reset (RST) attack |
12 | FIN attack |
20 | Ping flood |
51 | File Transfer Protocol (FTP) attack |
52 | Secure Shell (SSH) attack |
53 | HyperText Transfer Protocol (HTTP) attack |
54 | Hypertext Transfer Protocol Secure (HTTPS) attack |
else | Other |
Code | Anomaly |
---|---|
501 | File Transfer Protocol (FTP) traffic |
502 | Secure Shell (SSH) traffic |
503 | HyperText Transfer Protocol (HTTP) traffic |
504 | Hypertext Transfer Protocol Secure (HTTPS) traffic |
else | Other |
Dataset | Size on Amazon S3 | Query Run Time | Data Scanned | Cost |
---|---|---|---|---|
Data stored as CSV files | 1 TB | 236 s | 1.15 TB | $5.75 |
Data stored in Apache Parquet format | 130 GB | 6.78 s | 2.51 GB | $0.01 |
Savings/Speedup | 87% less using parquet | 34× faster | 99% less data scanned | 99.7% savings |
Average Size (CSV) | Average Size (Parquet) | Speedup |
---|---|---|
9.5 Kb | 6.5 Kb | ×1.46 |
Head Node | Worker Node | |
---|---|---|
Name | D3 V2 optimized | D4 V2 optimized |
Number | 2 | 2 |
CPU | 4 vCPUs | 8 vCPUs |
Memory (RAM) | 14 GB | 28 GB |
Storage | 200 GB SSD | 400 GB SSD |
Operating System (OS) | Linux (CentOS) ×64 bit. | Linux (CentOS) ×64 bit. |
Cost | $0.229/h | $0.458/h |
Measure | Description | Formula |
---|---|---|
Accuracy | Accuracy measures precision across all labels | |
Precision | Proportion of correct labels that were classified over all labels | |
Recall | Proportion of correct labels that were classified correctly over all positive labels | |
F-measure | Harmonic average of Precision and Recall |
Accuracy | Precision | Recall | F1 Score |
---|---|---|---|
99.95% | 99.91% | 99.95% | 99.93% |
Attack | Multi-Points | HTTP | Network Scan (TCP) | Alpha Flow | DDoS | DoS |
---|---|---|---|---|---|---|
Count | 88,429 | 67,063 | 23,281 | 12,427 | 4162 | 1928 |
Port Number | 0 (Unknown) | 80 | 443 | 53 | 6000 |
---|---|---|---|---|---|
Count | 100,656 | 65,368 | 18,569 | 9531 | 5906 |
Metric | Description | Result |
---|---|---|
Input Rate | Describes how many rows were loaded per second. | 555,470 rows per second |
Processing Rate | Describes how many rows were processed per second. | 55,175 rows per second |
© 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Hafsa, M.; Jemili, F. Comparative Study between Big Data Analysis Techniques in Intrusion Detection. Big Data Cogn. Comput. 2019, 3, 1. https://doi.org/10.3390/bdcc3010001
Hafsa M, Jemili F. Comparative Study between Big Data Analysis Techniques in Intrusion Detection. Big Data and Cognitive Computing. 2019; 3(1):1. https://doi.org/10.3390/bdcc3010001
Chicago/Turabian StyleHafsa, Mounir, and Farah Jemili. 2019. "Comparative Study between Big Data Analysis Techniques in Intrusion Detection" Big Data and Cognitive Computing 3, no. 1: 1. https://doi.org/10.3390/bdcc3010001