FSDC: Flow Samples and Dimensions Compression for Efficient Detection of DNS-over-HTTPS Tunnels
Abstract
1. Introduction
- Our proposed methodology creates an unlabeled and labeled new compressed binary dataset derived from the CIRA-CIC-DoHBrw-2020 dataset.
- We used a star graph and bar plot to analyze DoH tunnels
- We modeled DoH tunnels by evaluating four supervised cost-sensitive algorithms, RF, LR, SVM, and XGB, on both the CIRA-CIC-DoHBrw-2020 dataset and the new compressed dataset, and we compared the prediction and computation outcomes
- We modeled DoH tunnels by evaluating four anomaly detection algorithms, SVM, iForest, LOF, and MAD on unlabeled new compressed dataset, and conducted further analysis on the selected OCSVM algorithm
2. Background and Related Works
2.1. Background
2.2. Related Work
3. Materials and Methods
3.1. Datasets
3.1.1. CIRA-CIC-DoHBrw-2020 Dataset
3.1.2. Data Processing
3.1.3. Compressed Dataset
Algorithm 1: Computing Time Intervals (TI) and Frequency (CF) between consecutive unique connections. | |
1 | Input: : A set of flow records Ti: a Time at which flow record i occurred Si: Source IP address of flow record i Di: Destination IP address of flow record i |
Output: - TI: Time intervals between consecutive connections - CF: Connection frequency for each unique src and dest IP pair | |
2 | Procedure (, , , ) |
3 | Aggregate (, , , …, ) by unique (, ) pair into a group of unique connections |
4 | for do: //Sort the flow records within the group |
5 | X′← Sort the flow records by their occurrence time Ti in ascending order |
6 | for do://For each sorted group of flow records |
7 | //Initialize an empty list to store time intervals |
8 | //Compute the time interval between consecutive connections |
9 | //Add to the list of time intervals |
10 | //Count the number of flows records within each sorted group |
11 | end for |
12 | end for |
13 | end procedure |
3.2. Graph Modeling
3.3. Machine Learning Modeling
3.3.1. Supervised Models
Cost-Sensitive Logistic Regression (CS-LR)
Cost-Sensitive Support Vector Machine (CS-SVM)
Cost-Sensitive Random Forest (CS-RF)
Cost-Sensitive eXtreme Gradient Boosting (CS-XGB)
3.3.2. Anomaly Detection Models
One-Class Support Vector Machine (OCSVM)
Isolation Forest
Local Outlier Factor
Modified Z-Score
4. Experimental Results and Analysis
4.1. Graph Analysis
4.2. Bar Plot Analysis
4.3. Supervised Machine Learning
4.4. Effect of Flow Samples and Dimension Compression
4.5. Anomaly Detection
5. Conclusions
6. Limitations and Recommendations
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Hynek, K.; Vekshin, D.; Luxemburk, J.A.N.; Wasicek, A.; Member, S. Summary of DNS Over HTTPS Abuse. IEEE Access 2022, 10, 54668–54680. [Google Scholar] [CrossRef]
- Montazerishatoori, M.; Davidson, L.; Kaur, G.; Habibi Lashkari, A. Detection of DoH Tunnels Using Time-Series Classification of Encrypted Traffic. In Proceedings of the 2020 IEEE International Conference on Dependable, Autonomic and Secure Computing, International Conference on Pervasive Intelligence and Computing, International Conference on Cloud and Big Data Computing, International Conference on Cyber Science and Technology Congress (DASC/PiCom/CBDCom/CyberSciTech), Calgary, AB, Canada, 17–22 August 2020; pp. 63–70. [Google Scholar] [CrossRef]
- Abualghanam, O.; Alazzam, H.; Elshqeirat, B.; Qatawneh, M.; Almaiah, M.A. Real-Time Detection System for Data Exfiltration over DNS Tunneling Using Machine Learning. Electronics 2023, 12, 1467. [Google Scholar] [CrossRef]
- Nguyen, T.A.; Park, M. DoH Tunneling Detection System for Enterprise Network Using Deep Learning Technique. Appl. Sci. 2022, 12, 2416. [Google Scholar] [CrossRef]
- Irénée, M.; Wang, Y.; Hei, X.; Song, X.; Turiho, J.C.; Nyesheja, E.M. XTS: A Hybrid Framework to Detect DNS-Over-HTTPS Tunnels Based on XGBoost and Cooperative Game Theory. Mathematics 2023, 11, 2372. [Google Scholar] [CrossRef]
- DoHBrw 2020 Datasets. Available online: https://www.unb.ca/cic/datasets/dohbrw-2020.html (accessed on 25 November 2022).
- Hoffman, P.; McManus, P. DNS Queries over HTTPS (DoH). 2018. [Google Scholar] [CrossRef]
- Turing, A.; Ye, G. An Analysis of Godlua Backdoor. Available online: https://blog.netlab.360.com/an-analysis-of-godlua-backdoor-en/ (accessed on 24 November 2022).
- Ramos, F.M.; Wang, X. A Machine Learning Based Approach to Detect Stealthy Cobalt Strike C &C Activities from Encrypted Network Traffic. In Machine Learning for Networking; Lecture Notes in Computer Science (Including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics); Springer: Cham, Switzerland, 2023; Volume 13767, pp. 113–129. [Google Scholar] [CrossRef]
- Cobalt Strike|Defining Cobalt Strike Components & BEACON. Available online: https://www.mandiant.com/resources/blog/defining-cobalt-strike-components (accessed on 24 October 2023).
- Abu Talib, M.; Nasir, Q.; Bou Nassif, A.; Mokhamed, T.; Ahmed, N.; Mahfood, B. APT Beaconing Detection: A Systematic Review. Comput. Secur. 2022, 122, 102875. [Google Scholar] [CrossRef]
- Kryo.Se: Iodine (IP-over-DNS, IPv4 over DNS Tunnel). Available online: https://code.kryo.se/iodine/ (accessed on 26 November 2022).
- GitHub-Alex-Sector/Dns2tcp. Available online: https://github.com/alex-sector/dns2tcp (accessed on 26 November 2022).
- GitHub-Iagox86/Dnscat2. Available online: https://github.com/iagox86/dnscat2 (accessed on 26 November 2022).
- Behnke, M.; Briner, N.; Cullen, D.; Schwerdtfeger, K.; Warren, J.; Basnet, R.; Doleck, T. Feature Engineering and Machine Learning Model Comparison for Malicious Activity Detection in the DNS-Over-HTTPS Protocol. IEEE Access 2021, 9, 129902–129916. [Google Scholar] [CrossRef]
- Banadaki, Y.M.; Robert, S. Detecting Malicious DNS over HTTPS Traffic in Domain Name System Using Machine Learning Classifiers. J. Comput. Sci. Appl. 2020, 8, 46–55. [Google Scholar] [CrossRef]
- Jafar, M.T.; Al-Fawa’reh, M.; Al-Hrahsheh, Z.; Jafar, S.T. Analysis and Investigation of Malicious DNS Queries Using CIRA-CIC-DoHBrw-2020 Dataset. Manch. J. Artif. Intell. Appl. Sci. 2021, 2, 65–70. [Google Scholar]
- Vekshin, D.; Hynek, K.; Cejka, T. DoH Insight: Detecting DNS over HTTPS by Machine Learning. In Proceedings of the ACM International Conference Proceeding Series, New York, NY, USA, 19–23 October 2020. [Google Scholar]
- Jeřábek, K.; Hynek, K.; Čejka, T.; Ryšavý, O. Collection of Datasets with DNS over HTTPS Traffic. Data Brief 2022, 42, 108310. [Google Scholar] [CrossRef]
- Singh, S.K.; Roy, P.K. Detecting Malicious DNS over HTTPS Traffic Using Machine Learning. In Proceedings of the International Conference on Innovation and Intelligence for Informatics, Computing, and Technologies (3ICT 2020), Zallaq, Bahrain, 20–21 December 2020. [Google Scholar] [CrossRef]
- MontazeriShatoori, M. An Anomaly Detection Framework for DNS-over-HTTPS (DoH) Tunnel Using Time-Series Analysis. Bachelor’s Thesis, University of New Brunswick, Fredericton, NB, Canada, 2020. [Google Scholar]
- GitHub-Ahlashkari/DoHLyzer: DoHlyzer Is a DNS over HTTPS (DoH) Traffic Flow Generator and Analyzer for Anomaly Detection and Characterization. Available online: https://github.com/ahlashkari/DoHlyzer (accessed on 26 November 2022).
- Hofstede, R.; Čeleda, P.; Trammell, B.; Drago, I.; Sadre, R.; Sperotto, A.; Pras, A. Flow Monitoring Explained: From Packet Capture to Data Analysis with NetFlow and IPFIX. IEEE Commun. Surv. Tutorials 2014, 16, 2037–2064. [Google Scholar] [CrossRef]
- Stalder Zurich, D. Machine-Learning Based Detection of Malicious DNS-over-HTTPS (DoH) Traffic Based on Packet Captures. Bachelor’s Thesis, University of Zurich, Zürich, Switzerland, 2021. [Google Scholar]
- Yang, Z.; Liu, X.; Li, T.; Wu, D.; Wang, J.; Zhao, Y.; Han, H. A Systematic Literature Review of Methods and Datasets for Anomaly-Based Network Intrusion Detection. Comput. Secur. 2022, 116, 102675. [Google Scholar] [CrossRef]
- Géron, A. Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems, 2nd ed.; Roumeliotis, R., Nicole, T., Eds.; O’Reilly Media, Inc.: Sebastopol, CA, USA, 2017; ISBN 9781492032649. [Google Scholar]
- Brownlee, N.; Mills, C.; Ruth, G. RFC2722: Traffic Flow Measurement: Architecture. USA: RFC Editor. 1999. Available online: https://www.rfc-editor.org/rfc/rfc2722.html (accessed on 30 June 2024).
- Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic Minority Over-Sampling Technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
- Kleinbaum, D.G.; Klein, M. Logistic Regression: A Self-Learning Text; Statistics for Biology and Health; Springer: New York, NY, USA, 2010; ISBN 978-1-4419-1741-6. [Google Scholar]
- Amiri, P.A.D.; Pierre, S. An Ensemble-Based Machine Learning Model for Forecasting Network Traffic in VANET. IEEE Access 2023, 11, 22855–22870. [Google Scholar] [CrossRef]
- Singh, S.K.; Roy, P.K. Malicious Traffic Detection of DNS over HTTPS Using Ensemble Machine Learning. Int. J. Comput. Digit. Syst. 2022, 11, 1061–1069. [Google Scholar] [CrossRef] [PubMed]
- Support Vector Machine-Wikipedia. Available online: https://en.wikipedia.org/wiki/Support_vector_machine (accessed on 10 July 2023).
- Kuhn, M.; Johnson, K. Applied Predictive Modeling; Springer: New York, NY, USA, 2013; ISBN 9781461468493. [Google Scholar]
- Fernández, A.; García, S.; Galar, M.; Prati, R.C.; Krawczyk, B.; Herrera, F. Learning from Imbalanced Data Sets; Springer International Publishing: Cham, Switzerland, 2018. [Google Scholar]
- James, G.; Witten, D.; Hastie, T.; Tibshirani, R. An Introduction to Statistical Learning: With Applications in R; Springer Texts in Statistics; Springer: New York, NY, USA, 2021; ISBN 978-1-0716-1417-4. [Google Scholar]
- Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
- Ting, K.M. An Instance-Weighting Method to Induce Cost-Sensitive Trees. IEEE Trans. Knowl. Data Eng. 2002, 14, 659–665. [Google Scholar] [CrossRef]
- Mienye, I.D.; Sun, Y. Performance Analysis of Cost-Sensitive Learning Methods with Application to Imbalanced Medical Data. Inform. Med. Unlocked 2021, 25, 100690. [Google Scholar] [CrossRef]
- Brownlee, J. Cost-Sensitive. Imbalanced Classification with Python: Choose Better Metrics, Balance Skewed Classes, and Apply Cost-Sensitive Learning. Martin, S., Sanderson, M., Koshy, A., Andrei Cheremskoy, J.H., Eds.; 2020, pp. 237–240. Available online: https://www.amazon.com/Imbalanced-Classification-Python-Cost-Sensitive-Learning/dp/B09FP165TZ (accessed on 30 June 2024).
- Chen, T.; Guestrin, C. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; Association for Computing Machinery: New York, NY, USA, 2016; pp. 785–794. [Google Scholar]
- Chaabouni, N.; Mosbah, M.; Zemmari, A.; Sauvignac, C.; Faruki, P. Network Intrusion Detection for IoT Security Based on Learning Techniques. IEEE Commun. Surv. Tutor. 2019, 21, 2671–2701. [Google Scholar] [CrossRef]
- Scholkopf, B.; Williamson, R.; Smola, A.; Shawe-Taylor, J.; Platt, J.; Holloway, R. Support Vector Method for Novelty Detection. In Proceedings of the 12th International Conference on Neural Information Processing Systems, Denver, CO, USA, 29 November–4 December 1999; MIT Press: Denver, CO, USA, 1999. [Google Scholar]
- Liu, F.T.; Ting, K.M.; Zhou, Z.H. Isolation Forest. In Proceedings of the 2008 Eighth IEEE International Conference on Data Mining ICDM, Pisa, Italy, 15–19 December 2008; pp. 413–422. [Google Scholar] [CrossRef]
- Pimentel, M.A.F.; Clifton, D.A.; Clifton, L.; Tarassenko, L. A Review of Novelty Detection. In Signal Processing; Elsevier: Amsterdam, The Netherlands, 2014; Volume 99, pp. 215–249. [Google Scholar]
- Prasad, N.R.; Almanza-Garcia, S.; Lu, T.T. Anomaly Detection: A Survey. ACM Comput. Surv. 2009, 14, 1–22. [Google Scholar] [CrossRef]
- Freitas De Araujo-Filho, P.; Pinheiro, A.J.; Kaddoum, G.; Campelo, D.R.; Soares, F.L. An Efficient Intrusion Prevention System for CAN: Hindering Cyber-Attacks with a Low-Cost Platform. IEEE Access 2021, 9, 166855–166869. [Google Scholar] [CrossRef]
- Mirsky, Y.; Doitshman, T.; Elovici, Y.; Shabtai, A. Kitsune: An Ensemble of Autoencoders for Online Network Intrusion Detection. arXiv 2018, arXiv:1802.09089. [Google Scholar]
- Freitas De Araujo-Filho, P.; Kaddoum, G.; Campelo, D.R.; Gondim Santos, A.; Macedo, D.; Zanchettin, C. Intrusion Detection for Cyber-Physical Systems Using Generative Adversarial Networks in Fog Environment. IEEE Internet Things J. 2021, 8, 6247–6256. [Google Scholar] [CrossRef]
- Liu, F.T.; Ting, K.M.; Zhou, Z.H. Isolation-Based Anomaly Detection. ACM Trans. Knowl. Discov. from Data 2012, 6, 1–39. [Google Scholar] [CrossRef]
- Breuniq, M.M.; Kriegel, H.P.; Ng, R.T.; Sander, J. LOF: Identifying Density-Based Local Outliers. SIGMOD Rec. (ACM Spec. Interes. Gr. Manag. Data) 2000, 29, 93–104. [Google Scholar] [CrossRef]
- Song, X.; Wang, Y.; Zhu, L.; Ji, W.; Du, Y.; Hu, F. A Method for Fast Outlier Detection in High Dimensional Database Log. In Proceedings of the Proceedings-2021 International Conference on Networking and Network Applications, NaNA 2021, Lijiang City, China, 29 October–1 November 2021; Institute of Electrical and Electronics Engineers Inc.: Piscataway, NJ, USA, 2021; pp. 236–241. [Google Scholar]
- Rita/Analyzer.Go at Master Activecm/Rita GitHub. Available online: https://github.com/activecm/rita/blob/master/pkg/beacon/analyzer.go (accessed on 29 April 2023).
- Leys, C.; Ley, C.; Klein, O.; Bernard, P.; Licata, L. Detecting Outliers: Do Not Use Standard Deviation around the Mean, Use Absolute Deviation around the Median. J. Exp. Soc. Psychol. 2013, 49, 764–766. [Google Scholar] [CrossRef]
- Miller, J. Short Report: Reaction Time Analysis with Outlier Exclusion: Bias Varies with Sample Size. Exp. Psychol. Soc. 1991, 43, 907–912. [Google Scholar] [CrossRef]
- Perera, P.; Oza, P.; Member, S.; Patel, V.M.; Member, S. One-Class Classification: A Survey. arXiv 2021, arXiv:2101.03064. [Google Scholar]
Public DoH IP addresses | 1.1.1.1 | 8.8.8.8 | 9.9.9.10 |
8.8.4.4 | 9.9.9.9 | 9.9.9.11 | |
176.103.130.131 | 176.103.130.130 | 149.112.112.10 | |
149.112.112.112 | 104.16.248.249 | 104.16.249.249 | |
Source IP used to connect to websites (Google Chrome) | 192.168.20.191 | ||
Source IPs used to connect to websites (Mozilla Firefox) | 192.168.20.111 | 192.168.20.112 | 192.168.20.113 |
Source IPs used to create DoH tunnels | 192.168.20.144 | 192.168.20.204 | 192.168.20.205 |
192.168.20.206 | 192.168.20.207 | 192.168.20.208 | |
192.168.20.209 | 192.168.20.210 | 192.168.20.211 | |
192.168.20.212 |
Category | Feature Name |
---|---|
Flow Direction | F1: Source IP, F2: Destination IP, F3: Source Port, F4: Destination Port. |
Packet Bytes | F5: Duration, F6: Number of flow bytes sent, F7: Rate of flow bytes sent, F8; Number of flow bytes received, F9: Rates of flow bytes received. |
Packet Length | F10: Mean, F11: Median, F12: Mode, F13: Variance, F14: Standard deviation, F15: Coefficient of variation, F16: Skew from median, F17: Skew from mode. |
Packet Time | F18: Mean, F19: Median, F20: Mode, F21: Variance, F22: Standard Deviation, F23: Coefficient of variation, F24: Skew from median, F25: Skew from mode |
Request/response time difference | F26: Mean, F27: Median, F28: Mode, F29: Variance, F30: Standard Deviation, F31: Coefficient of variation, F32: Skew from median, F33: Skew from mode. |
Predicted Positive | Predicted Negative | |
---|---|---|
Actual Positive | C (1, 1) = 1 | C (0, 1) = n/p |
Actual Negative | C (1, 0) = 1 | C (0, 0) = 1 |
Original Dataset | Compressed Dataset | |||
---|---|---|---|---|
Dataset | Datasets sizes | ) | ||
Merged | Non-DoH | |||
Benign DoH | ||||
Normal | ||||
Malicious DoH |
Dataset | Method | Prediction Performance | Computation Time (s) | ||||
---|---|---|---|---|---|---|---|
P | R | F1 | Training | Testing | # of Predictors | ||
CIRA-CIC-DoHBrw-2020 | LR | 82.367 | 95.399 | 88.86.63 | 18.863 | 0.541 | 33 |
SVM | 95.552 | 98.4322 | 96.97 | 4789.607 | 449.549 | ||
RF | 99.989 | 99.905 | 99.47 | 298.558 | 2.192 | ||
XGB | 99.993 | 99.998 | 99.995 | 62.681 | 0.111 | ||
Recent studies | XTS | 99.99 | 99.96 | 99.94 | 1.8 | 0.07 | 3 |
New compressed | LR | 68.8 | 100 | 81.5 | 0.18 | 0.004 | 2 |
SVM | 100 | 90.9 | 95.2 | 0.132 | 0.02 | ||
RF | 100 | 100 | 100 | 0.44 | 0.04 | ||
XGB | 50 | 100 | 66.7 | 0.1 | 0.004 |
Method | Training | Testing |
---|---|---|
LR | 104.8 | 135.3 |
SVM | 36,284.9 | 22,477.5 |
RF | 678.5 | 54.8 |
XGB | 626.8 | 27.8 |
XTS * | 18 | 17.5 |
Model | Precision | Recall | F1 Score | Training (ms) | Testing (ms) |
---|---|---|---|---|---|
OCSVM | 88.89 | 100 | 94.12 | 336.1 | 28.9 |
IF | 78.57 | 100 | 88 | 597.4 | 137.6 |
LOF | 88.89 | 100 | 94.12 | 398.9 | 49.8 |
MAD | 0.98 | 94.12 | 1.95 | N/A | N/A |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Mungwarakarama, I.; Wang, Y.; Hei, X.; Song, X.; Nyesheja, E.M.; Turiho, J.C. FSDC: Flow Samples and Dimensions Compression for Efficient Detection of DNS-over-HTTPS Tunnels. Electronics 2024, 13, 2604. https://doi.org/10.3390/electronics13132604
Mungwarakarama I, Wang Y, Hei X, Song X, Nyesheja EM, Turiho JC. FSDC: Flow Samples and Dimensions Compression for Efficient Detection of DNS-over-HTTPS Tunnels. Electronics. 2024; 13(13):2604. https://doi.org/10.3390/electronics13132604
Chicago/Turabian StyleMungwarakarama, Irénée, Yichuan Wang, Xinhong Hei, Xin Song, Enan Muhire Nyesheja, and Jean Claude Turiho. 2024. "FSDC: Flow Samples and Dimensions Compression for Efficient Detection of DNS-over-HTTPS Tunnels" Electronics 13, no. 13: 2604. https://doi.org/10.3390/electronics13132604
APA StyleMungwarakarama, I., Wang, Y., Hei, X., Song, X., Nyesheja, E. M., & Turiho, J. C. (2024). FSDC: Flow Samples and Dimensions Compression for Efficient Detection of DNS-over-HTTPS Tunnels. Electronics, 13(13), 2604. https://doi.org/10.3390/electronics13132604