Enhancing Botnet Detection in Network Security Using Profile Hidden Markov Models

Mannikar, Rucha; Di Troia, Fabio

doi:10.3390/app14104019

Open AccessArticle

Enhancing Botnet Detection in Network Security Using Profile Hidden Markov Models

by

Rucha Mannikar

and

Fabio Di Troia

^*

Department of Computer Science, San Jose State University, One Washington Square, San Jose, CA 95192, USA

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(10), 4019; https://doi.org/10.3390/app14104019

Submission received: 25 March 2024 / Revised: 22 April 2024 / Accepted: 7 May 2024 / Published: 9 May 2024

(This article belongs to the Special Issue Network Information Theory and Its Applications in Security and Privacy)

Download

Browse Figures

Versions Notes

Abstract

:

A botnet is a network of compromised computer systems, or bots, remotely controlled by an attacker through bot controllers. This covert network poses a threat through large-scale cyber attacks, including phishing, distributed denial of service (DDoS), data theft, and server crashes. Botnets often camouflage their activity by utilizing common internet protocols, such as HTTP and IRC, making their detection challenging. This paper addresses this threat by proposing a method to identify botnets based on distinctive communication patterns between command and control servers and bots. Recognizable traits in botnet behavior, such as coordinated attacks, heartbeat signals, and periodic command distribution, are analyzed. Probabilistic models, specifically Hidden Markov Models (HMMs) and Profile Hidden Markov Models (PHMMs), are employed to learn and identify these activity patterns in network traffic data. This work utilizes publicly available datasets containing a combination of botnet, normal, and background traffic to train and test these models. The comparative analysis reveals that both HMMs and PHMMs are effective in detecting botnets, with PHMMs exhibiting superior accuracy in botnet detection compared to HMMs.

Keywords:

network analysis; botnets; malware detection; Hidden Markov Models; Profile Hidden Markov Models

1. Introduction

A botnet is a network infected by malware, where computer systems, known as bots, are remotely managed by an attacker, referred to as the botmaster. The botmaster orchestrates synchronized attacks on a distributed platform, facilitating malicious activities on a large scale. Consequently, botnet attacks pose a significant threat to Internet security, often employed for purposes such as Distributed Denial of Service (DDoS) or personal information theft. Any computer system connected to the internet is susceptible to botnet attacks [1,2].

Communication among bots within the botnet utilizes standard networking protocols, making the detection of botnets challenging. However, specific communication patterns can be identified and utilized for botnet detection. The command and control (C&C) channels serve as the primary means to transmit commands between the bots and the botmaster. Bots are required to connect to the botmaster at regular intervals using the C&C servers, and this periodicity can be leveraged for botnet detection. Additionally, various features can be extracted from network traffic, utilizing flow-level characteristics, such as patterns in the number of packets, or network-level features, like degrees of centrality [3,4].

The primary goal of this work is to propose a Hidden Markov Model (HMM) and a Profile Hidden Markov Model (PHMM) to identify botnet communication using network flow data. Specifically, we address the following challenges:

Is accurate detection and classification achievable given a network flow dataset from botnet activities, normal traffic, and background traffic?
The botnet traffic data are significantly lower than normal and background data, resulting in a data imbalance issue. How can this problem be effectively addressed?
How can meaningful features be extracted from network flow data to train HMMs and PHMMs?

The structure of this paper is organized as follows. Section 2 provides background details on the topics covered in the project’s implementation. Section 3 explores relevant work related to this research. Section 4 delves into the work’s implementation, dataset, and methodology, covering steps such as data pre-processing, feature extraction, HMM and PHMM training, and evaluation. Section 5 consists of a comparison and analysis of the results obtained from the experiment. Finally, Section 6 proposes our conclusions.

2. Background

This section elaborates on the background of the topics used in the implementation of this research. It discusses botnets, autocorrelation analysis, degrees of centrality, Hidden Markov Models, and Profile Hidden Markov Models.

2.1. Botnets

A botnet is defined as a network of infected computers that perform malicious activities, such as information theft, performing DDoS, sending spam, and so on. Botnet architecture is shown in Figure 1. It consists of a botmaster, bot clients, C&C communication protocols, and a C&C server [5].

Botmaster. The botmaster manages and regulates communication with the bot clients. It controls the botnet army.
Bot clients. Bot clients, also called bots, are compromised computers that get added to the botnet and perform malicious activities in synchronization when commanded by the botmaster.
C&C communication protocols. These are communication protocols that the botnet uses to send and receive commands from the botmaster to the bots. The communication protocols can be, for example, Internet Relay Chat (IRC), Peer-to-Peer (P2P), Hypertext Transfer Protocol (HTTP), or hybrid, such as a combination of HTTP and P2P protocols.
C&C servers. The botmaster sends commands through C&C servers, and the bot clients periodically connect to the C&C servers to receive the commands from the botmaster.

The botnet life cycle is shown in Figure 2. It consists of multiple phases, including initial infection, secondary infection, connecting to C&C servers, rallying, malicious commands, and maintaining/updating the malware [5].

2.2. Autocorrelation Analysis

Autocorrelation analysis is used to calculate and quantify the relationship of time series data points between distinct instances of time to detect a trend. It is used to find similarities with itself after specific time intervals [3]. The autocorrelation formula to calculate the correlation between the current data and data lagged by k units is as follows:

ρ (k) = \frac{\sum_{t = k + 1}^{n} (x_{t} - \bar{x}) (x_{t - k} - \bar{x})}{\sum_{t = 1}^{n} {(x_{t} - \bar{x})}^{2}}

where

ρ (k)

is the autocorrelation at lag k,

x_{t}

is the data value at time t,

\bar{x}

is the mean of the data, and n is the total number of data points. Figure 3 shows an example of periodic and non periodic data. In the figure, each dot signifies an autocorrelation value, which is derived from the corresponding sampled data points of the signal. We can observe the periodic nature of such values when they are generated from a periodic signal.

2.3. Degree of Centrality

The degree of centrality is used to quantify the connectivity of the node with other nodes. It measures the number of edges for a node. A well-connected node has a higher degree of centrality. The formula to calculate it is as follows:

C_{D} (v) = \frac{number of edges incident to node v}{total number of nodes - 1}

where

C_{D} (v)

is the degree centrality of node v.

Thus, in the context of network traffic, degree centrality can be used to calculate the importance of the computer system (node) in the network [4].

2.4. Hidden Markov Models

The Hidden Markov Model (HMM) is a statistical model for Markovian systems where the future state depends only on the current state, irrespective of the historical states. In HMM, the states are hidden, in contrast to the Markov Model. The HMM behaves like a state machine where every state has a probability distribution associating the hidden state with the observation sequence. HMMs can be trained using the observation sequences as input and calculate the score for a test sequence. Thus, the probability of the occurrence of such a sequence can be calculated. Table 1 consists of the standard notation for HMM, while Figure 4 depicts its model. Additional details can be found in [6].

HMMs are used to solve the following three standard problems [7]:

Problem 1. Given a model

λ = (A, B, π)

and an observation sequence O, we can determine the probability

P (O | λ)

. This is to calculate the score of the test observation sequence.

Problem 2. Given a model

λ = (A, B, π)

and an observation sequence O, we can determine an optimal state sequence for the Markov model by maximizing the expected number of correct states.

Problem 3. Given an observation sequence O and the parameter N, we can determine a model

λ

that maximizes the probability of O. Thus, we train a model to best fit an observation sequence.

For this work, we use Problem 3 to train the HMM and Problem 1 to classify our test data.

2.5. Profile Hidden Markov Models

Profile Hidden Markov Models (PHMMs) are also statistical models and an extension of HMMs, where we can view PHMM as a series of HMMs. The training of a PHMM involves several steps, like creating multiple sequence alignment (MSA) using the training sequences. Every column in the alignment is used to generate a score corresponding to each position. The PHMM has three states—match, insert, and delete states. Insert and delete states are used for the gaps in aligned sequences. The assignment of transition probabilities is performed to identify dependencies between adjoining positions [8]. Table 2 consists of the standard notation for PHMM, while Figure 5 depicts its model. Additional details can be found in [8].

In this work, we use the HMMER software package to train the PHMM.

3. Literature Review and Proposed Research

There are different approaches for detecting botnets that can be categorized into four classes: signature-based, anomaly-based, Domain Name System (DNS)-based, and data mining-based approaches [10].

Signature-based detection. In this method, traits in byte sequences of the packet captures are used for botnet identification by finding similarities with known signatures of the botnets. This technique performs well when the signature database is well-updated [11].
Anomaly-based detection. These techniques focus on the unusual traits demonstrated by the network traffic. They can be useful to identify new malicious activities that have not been encountered in the past. However, normal activities resulting in unusual spikes can be misclassified [11].
DNS-based detection. The DNS server has the records of the IP addresses and the corresponding websites. DNS-based techniques involve analyzing DNS traffic flow. Unusual spikes in the DNS queries can indicate botnet activity [11].
Data-mining-based detection. These approaches involve data mining methods by extracting meaningful features from the network traffic data. These new features may involve statistical analysis of the network flow data or study of the network topology. In such methods, a labeled dataset can be used to train machine learning models to perform classification [11].

Among all these techniques, data mining techniques for detecting botnets have several advantages. As botnets perform attacks on a large scale, data mining techniques can be useful to process the large volumes of data generated in the packet captures. When newer types of malicious activities are encountered, they can be easily tracked with these methods as compared to signature-based techniques. As features are extracted, subtle characteristics can be captured by data mining. In addition to this, these methods can help in predicting attacks based on past data [1].

It can be observed that supervised learning techniques produce good results in detecting botnets from network traffic data [1]. Naive Bayes, Random Forest, Artificial Neural Networks, and Support Vector Machines are reported to be effective models [7]. A novel approach of malware clustering based on HMM scores was used to detect malware with good accuracy even for previously unseen samples [3]. Deep learning models were employed with the CTU-13 dataset for classifying botnets, showing promising results [3]. Unsupervised machine learning using clustering algorithms, like k-means, x-means, and EM clustering, was utilized for botnet detection, revealing dissimilarities between botnet flow data and normal data [12]. Real-time botnet analysis was addressed by [13], using machine learning with minimal features. Behavior-based approaches with machine learning algorithms, such as Multilayer Perceptron, k-Nearest Neighbor, and Support Vector Machine, were shown to be effective for botnet detection [14]. Different machine learning models, including Decision Tree, Random Forest, and Adaboost, were used with a combination of undersampling and oversampling for data balancing, yielding positive outcomes [15]. Gaussian Naive Bayes, neural networks, and Decision Trees are employed for botnet detection with the CTU-13 dataset [16]. Ref. [17] focused on detecting social bots using deep learning techniques. Ref. [18] employed an ensemble learning algorithm for botnet detection, combining results from various supervised machine learning algorithms. Smartphone-specific features were considered for botnet detection in internet-connected smartphones, demonstrating a novel approach [19]. Android botnets were targeted using machine learning algorithms like Naive Bayes and Random Forest [20]. Machine learning models were also used to detect botnets in the Internet of Things (IoT), handling data balancing with SMOTE [21].

Thus, using machine learning models to detect and classify botnets using network flow data can achieve good results. Also, several researchers have used the CTU-13 dataset with machine learning approaches with successful outcomes. Some studies also used SMOTE with the CTU-13 dataset for oversampling to balance the data [21]. Notably, some studies also explored the use of HMMs and PHMMs in different areas of cybersecurity [22,23]. These probabilistic models capture activity patterns that indicate botnet behavior. However, it is worth noting that gaps exist in the literature regarding their specific application in botnet detection. In our proposed work, we aim to bridge those existing gaps by applying an HMM and a PHMM for botnet detection. Our goal is to contribute novel insights to the field of botnet detection, ultimately improving network security and safeguarding against cyber threats.

4. Methodology and Implementation

We first describe the dataset used in this research, that is, the CTU-13 dataset, a publicly available dataset collected by CTU University, Czech Republic [24]. It consists of three types of network traffic: botnet, normal, and background. The dataset comprises thirteen files, each representing a specific scenario. In our work, we focused on the bidirectional NetFlows data. It is worth noting that the dataset has a lower proportion of botnet traffic data samples compared to normal and background traffic data. This proportion allowed us to better simulate a realistic scenario where legitimate traffic is more prevalent.

Table 3 provides details of CTU-13 dataset scenarios, including corresponding botnet types and the protocols used by each type.

The CTU-13 dataset includes 15 features for bidirectional network flow data, as outlined in Table 4. The dataset consists of BINETFLOW-type files, which were used as input for pre-processing.

The objective of this work was to categorize network flow data as either botnet activity or normal activity and, further, to classify botnets into the seven families (Neris, RBot, Virut, DonBot, Sogou, Murlo, and NSIS.ay) present in the CTU-13 dataset. The proposed methodology involved feature extraction and training a Hidden Markov Model (HMM) and a Profile Hidden Markov Model (PHMM) to accurately classify unknown data. The process comprised two main phases: data pre-processing and model training with evaluation. The former involved extracting meaningful features from the 15 available attributes mentioned in Table 4. The latter encompassed training, testing, and evaluating the HMM and PHMM models. Figure 6 provides an overview of the overall implementation process.

4.1. Data Pre-Processing

The data pre-processing phase consists of multiple steps, including filtering network flow, creating a network graph, deriving statistical data from the graph, calculating autocorrelation from edges, determining the degree centrality of the source node, and normalizing the data. Additionally, for the PHMM, a binning step is required to convert data into alphabetical strings, forming sequences in multiple sequence alignment (MSA) format. These sequences are stored in FASTA format for PHMM input. Figure 7 outlines the common data pre-processing steps required for both the HMM and PHMM.

We outline the steps involved in pre-processing our data as follows:

Basic data cleaning. Basic data cleaning involves checking for null values and repetitions and sorting the data in ascending order by StartTime.
Filter network flow. From the 15 available features mentioned in Table 4, 10 significant features are selected, and unwanted protocols and states are removed based on the communication characteristics associated with botnets. Table 5 displays sample data entries after network flow filtering.
Update labels. Botnet labels in the CTU-13 dataset are modified, and unnecessary labels are dropped based on the analysis. The revised labels distinguish between botnet and normal traffic effectively.
Create network graph. A network graph is constructed with nodes representing computer systems and edges indicating communication between these systems. The network graph is created using IPs from SrcAddr and DstAddr. The design and algorithm for constructing the network graph are outlined in Figure 8 and Algorithm 1, respectively.
Extracting statistical features from network graph. Statistical analysis is performed on features such as Dur, TotPkts, and TotBytes for meaningful feature extraction. The derived statistics include mean, median, standard deviation, minimum, maximum, and range, resulting in 18 new statistical features.
Extracting autocorrelation features from network graph. Autocorrelation features are calculated for attributes such as Dur, TotPkts, and TotBytes. Periodicity and strength of autocorrelation are determined, resulting in 9 new autocorrelation features. Figure 9a,b illustrate the periodicity observed in total bytes and duration, respectively.
Extracting topology features from network graph. The degree of centrality of the source node is calculated for each edge to quantify the importance of a particular node in the network. Nodes with a high degree of centrality are indicative of botnets.
Normalization MinMax normalization is applied to ensure that all values are within the range of 0 to 1. This is essential for training the Gaussian HMM in the experiment and facilitates additional pre-processing for the PHMM.

After data pre-processing, a total of 30 features were extracted, as tabulated in Table 6.

Algorithm 1: Pseudo-code for constructing network graph [3].

4.2. Hidden Markov Model (HMM)

The HMM was trained using the Gaussian HMM from the hmmlearn library in Python. Two scenarios were considered: binary classification as botnet or normal (N = 2), and multiclass classification into the seven botnet types (N = 8). To address data imbalance, a combination of oversampling and undersampling was applied during training. The train-to-test ratio was set at 80:20, and a maximum of 1000 iterations was used for HMM. The model demonstrated an effective training accuracy of 95%.

Here, we outline a step-by-step recap:

The dataset was divided into training and testing data subsets.
The training-to-test data split ratio was set at 80:20.
Balancing of the dataset was applied.
The HMM was trained with a maximum of 1000 iterations.
The HMM was applied to classify data as either botnet or normal.
The HMM was applied to classify seven malware families plus normal traffic.

4.3. Profile Hidden Markov Model (PHMM)

The data were split into training and testing sets in an 80:20 ratio, maintaining an equal representation of botnet and normal traffic using stratified sampling. Data imbalance was handled using SMOTE for oversampling. The PHMM required additional pre-processing steps, including scaling data to the range 0 to 1, converting data into uppercase alphabetical strings using binning and forming sequences for input. The sequences were stored in FASTA format for PHMM input. Testing involved calculating scores for each trained PHMM and classifying based on the maximum score. Sequences that did not match any of the seven PHMM models were classified as normal traffic.

The test data consisted of 2231 normal traffic sequences and 2852 botnet samples. The PHMM, being a more complex model than the HMM, required more data for training and achieved an accuracy of 97%.

The FASTA files served as input to the PHMM model, which utilized the HMMER library for training seven separate PHMMs, one for each botnet type.

Here, we outline a step-by-step recap:

The dataset was divided into training and test data subsets.
The training-to-test data split ratio was set at 80:20.
Balancing of the dataset was applied.
The original pre-processed data were scaled to a range of 0 to 1.
Data were converted to uppercase alphabetical strings (A–Z) using a process called binning.
These features were concatenated to form sequences.
The sequences were then converted into FASTA format files.
One separate PHMM was trained for each of the seven distinct botnet types.
Scores were calculated for each trained PHMM.
Classification was performed based on the maximum score.
Sequences that did not match any of the seven PHMM models were classified as normal traffic.
The PHMM was applied to classify data as either botnet or normal.
The PHMM was applied to classify seven malware families plus normal traffic.

5. Results

The solution proposed was implemented using two models: Hidden Markov Model (HMM) and Profile Hidden Markov Model (PHMM). For the HMM, the Python library ‘hmmlearn’ was utilized, while the PHMM implementation made use of the HMMER tool. The performance of these models was evaluated using various metrics such as training and testing accuracies, precision, recall, and F1 score. The results of these evaluations are presented in Table 7, Table 8 and Table 9.

Table 7 presents the model evaluation for N = 8. The highest accuracy was achieved for Donbot, followed by Neris. This could be attributed to the larger number of botnet data samples present in the dataset. The lowest accuracy was observed for the detection of normal traffic. Figure 10a displays the confusion matrix for multiclass classification for HMM. The test data consisted of 835 normal traffic samples and 837 botnet samples. Although the botnets were classified effectively, the classification of normal traffic was not as successful. This could be due to the greater diversity and variability in normal network traffic, which is influenced by factors such as different protocols or user behavior. In contrast, botnet traffic tended to exhibit similar patterns. Table 8 presents the model evaluation for N = 2, and Figure 10c shows the corresponding confusion matrix.

For the PHMM implementation, the normalized data were binned to correspond with uppercase English alphabetical characters, using a total of 25 bins to match the first 25 letters of the alphabet. Bin edges were defined for each column in the dataset. The training and testing datasets were then binned.Then, the data in the binned columns were concatenated to form the sequences. These sequences were converted to the FASTA format required for HMMER input, with each sequence assigned a unique sequence number, like ‘Seq0’, added as the header line for each sequence. These sequences were inputted into HMMER, with true labels stored in separate files for train and test sequences. The output from HMMER resulted in a text file. This text file was parsed to store the data in Python dataframes. The score from the result file, which represents the likelihood of the sequence belonging to the class, was used to determine the classes. The test sequences were inputted for all seven PHMMs, and the scores were compared to determine the probable class with the maximum score values. Sequences that did not generate a match for any of the seven PHMMs trained for seven botnets were classified as normal. These predicted labels were compared against true labels to evaluate the model. Table 9 presents the results obtained for the PHMM. The highest accuracy was observed for NSIS.ay, and the lowest was for normal network traffic. As with the HMM, this might be due to the greater diversity and variability in normal network traffic. Figure 10b displays the confusion matrix for multiclass classification for the PHMM. The test data consisted of 2231 normal traffic sequences and 2852 botnet samples.

From Table 10, we can observe the average accuracies for training and testing data for both the HMM and PHMM. The PHMM performed better in terms of accuracy. Also, from Table 9, we can see that the accuracy for classifying normal traffic was better in the case of the PHMM. This could be due to the complexity of the PHMM model compared to the HMM, enabling it to perform better. It can be seen that accuracies are almost the same. However, from Table 11, we can see that the F1 score for HMM is lower than PHMM. This is because the precision of the HMM is lower than that of the PHMM, as normal traffic was misclassified as botnet traffic since normal network traffic has more diversity and variability than botnet traffic.

The PHMM model performed better than the HMM model, considering the different evaluation metrics. This might be because the PHMM could register subtle details in a sequence alignment, such as position-specific details and alignment characteristics. HMMs do not use positional details in observation sequences. However, the execution of HMMs is faster than that of PHMMs.

5.1. Comparison with Other Models

The multiclass nature of our experiments makes botnet detection particularly challenging. In Table 12, we compare our HMM and PHMM approaches to other detection models. In terms of accuracy, the results obtained with HMM are comparable to k-Nearest Neighbors (kNN) and only inferior to Multilayer Perceptron (MLP). The PHMM, on the other hand, surpasses all other models except for MLP, to which it has a close accuracy (0.97 compared to 0.99 for MLP). When considering the F1 score, we observe the effect of class imbalance reducing the overall detection rate. The F1 score for the HMM is reduced to 0.78, which is comparable to Decision Trees, superior to Support Vector Machines (SVMs) and Naive Bayes, but inferior to kNN and MLP. The PHMM, however, has a score of 0.89, which is superior to all methods but still inferior to MLP.

We can conclude that the PHMM, in particular, is a method that closely matches the performance we can achieve with MLP. The advantage in this case is its faster training time and the ability to easily update its model compared to retraining an MLP model. In fact, the original model can be updated with new sequences by simply continuing the learning process from where we left off.

5.2. Theoretical, Practical, Societal, and Methodological Implications

Our study contributes to the theoretical understanding of botnet detection by applying an HMM and a PHMM. These probabilistic models effectively capture intricate patterns within network traffic data, shedding light on the dynamics of botnet behavior. The comparative analysis between the HMM and PHMM provides valuable insights into their relative complexities and capabilities. Moreover, the successful implementation of these models demonstrates their practical viability for botnet detection. Notably, the achieved accuracy for specific botnet types underscores their practical utility in identifying distinct threats. However, challenges persist in classifying normal traffic, emphasizing the need for ongoing refinement and adaptation to diverse network scenarios.

From a societal perspective, effective botnet detection directly impacts cybersecurity, safeguarding individuals, organizations, and society as a whole from malicious activities. By accurately identifying botnets, we contribute to a safer digital environment. Methodologically, our approach—encompassing data handling, pre-processing, and model selection—provides a blueprint for future research in this field. Additionally, the F1 score comparison between the HMM and PHMM serves as a valuable guide for model selection based on specific requirements. The PHMM’s ability to discern subtle details in sequence alignment further enhances its performance compared to the HMM.

In summary, our study advances theoretical knowledge, offers practical solutions, impacts society’s security, and contributes methodological insights to the critical domain of botnet detection.

6. Conclusions

In this study, we implemented a botnet detection approach using network flow data, leveraging both Hidden Markov Models (HMMs) and Profile Hidden Markov Models (PHMMs). The project involved several stages, including data pre-processing and the extraction of statistical, autocorrelation, and network topology features, followed by model training, testing, and evaluation.

Our results indicated that the PHMM outperformed the HMM, demonstrating superior performance in classifying botnet families. In contrast, the HMM delivered average performance. While the PHMM demonstrated superior performance in classifying botnet families, the HMM yielded only average results. The limitations of the HMM may stem from its inherent assumptions (e.g., Markovian behavior) and simplicity compared to more complex models. It is important to acknowledge that both models have their limitations, and further research can explore hybrid approaches or ensemble methods.

Although the HMM and PHMM were effective, other models, such as neural networks, have been reported to achieve better outcomes in similar studies [3]. Another limitation is due to relying on network flow data only. Future work can explore additional data sources or incorporate more diverse features to enhance model robustness. We also propose considering additional topological features, such as betweenness centrality, clustering coefficient, and PageRank. Furthermore, other models like Support Vector Machines (SVMs) and Random Forest can be explored to potentially achieve improved results. Additionally, exploring advanced techniques beyond SMOTE, such as Generative Artificial Intelligence models to generate synthetic data and balance botnet and normal traffic samples, can prove to be even more effective.

Author Contributions

Conceptualization, F.D.T.; Software, R.M.; Investigation, F.D.T.; Resources, F.D.T.; Data curation, R.M.; Writing—original draft, R.M.; Writing—review & editing, F.D.T.; Supervision, F.D.T. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The dataset used in this article can be found at https://www.stratosphereips.org/datasets-ctu13/ (accessed on 6 May 2024).

Conflicts of Interest

The authors declare no conflict of interests.

References

Stevanovic, M.; Pedersen, J.M. An efficient flow-based botnet detection using supervised machine learning. In Proceedings of the 2014 International Conference on Computing, Networking and Communications (ICNC), Honolulu, HI, USA, 3–6 February 2014; pp. 797–801. [Google Scholar]
Saad, S.; Traore, I.; Ghorbani, A.; Sayed, B.; Zhao, D.; Lu, W.; Felix, J.; Hakimian, P. Detecting P2P botnets through network behavior analysis and machine learning. In Proceedings of the 2011 Ninth Annual International Conference on Privacy, Security and Trust, Montreal, QC, Canada, 19–21 July 2011; pp. 174–180. [Google Scholar]
Lee, J.A.; Di Troia, F. Detecting Botnets Through Deep Learning and Network Flow Analysis. In Artificial Intelligence for Cybersecurity; Stamp, M., Aaron Visaggio, C., Mercaldo, F., Di Troia, F., Eds.; Springer International Publishing: Cham, Switzerland, 2022; pp. 85–105. [Google Scholar]
Chowdhury, S.; Khanzadeh, M.; Akula, R.; Zhang, F.; Zhang, S.; Medal, H.; Marufuzzaman, M.; Bian, L. Botnet detection using graph-based feature clustering. J. Big Data 2017, 4, 1–23. [Google Scholar] [CrossRef]
Limarunothai, R.; Munlin, M.A. Trends and Challenges of Botnet Architectures and Detection Techniques. J. Inf. Sci. Technol. 2015, 5, 51. [Google Scholar]
Stamp, M. A Revealing Introduction to Hidden Markov Models. 2012. Available online: http://www.cs.sjsu.edu/faculty/stamp/RUA/HMM.pdf (accessed on 6 May 2024).
Annachhatre, C.; Austin, T.H.; Stamp, M. Hidden Markov models for malware classification. J. Comput. Virol. Hacking Tech. 2015, 11, 59–73. [Google Scholar] [CrossRef]
Stamp, M. Introduction to Machine Learning with Applications in Information Security; CRC Press: Boca Raton, FL, USA, 2022. [Google Scholar]
Stamp, M. Profile Hidden Markov Models and Metamorphic Virus Detection. 2012. Available online: http://www.cs.sjsu.edu/faculty/stamp/papers/profileHMM4.pdf (accessed on 6 May 2024).
García, S.; Zunino, A.; Campo, M. Survey on network-based botnet detection methods. Secur. Commun. Netw. 2014, 7, 878–903. [Google Scholar] [CrossRef]
Mammunni, S.R.; Sandhya, C.P. An overview of botnet and its detection techniques. Int. J. Creat. Res. Thoughts 2020, 8. Available online: https://ijcrt.org/papers/IJCRT2003313.pdf (accessed on 6 May 2024).
Wu, W.; Alvarez, J.; Liu, C.; Sun, H.M. Bot detection using unsupervised machine learning. Microsyst. Technol. 2018, 24, 209–217. [Google Scholar] [CrossRef]
Velasco-Mata, J.; González-Castro, V.; Fidalgo, E.; Alegre, E. Real-time botnet detection on large network bandwidths using machine learning. Sci. Rep. 2023, 13, 4282. [Google Scholar] [CrossRef] [PubMed]
Ibrahim, W.N.H.; Anuar, S.; Selamat, A.; Krejcar, O.; Gonzalez Crespo, R.; Herrera-Viedma, E.; Fujita, H. Multilayer framework for botnet detection using machine learning algorithms. IEEE Access 2021, 9, 48753–48768. [Google Scholar] [CrossRef]
Vishwakarma, A.R. Network Traffic Based Botnet Detection Using Machine Learning. Ph.D. Thesis, San Jose State University, San Jose, CA, USA, 2020. [Google Scholar]
Ryu, S. Comparison of Machine Learning Algorithms and Their Ensembles for Botnet Detection. Ph.D. Thesis, Purdue University, West Lafayette, IN, USA, 2018. [Google Scholar]
Hayawi, K.; Saha, S.; Masud, M.M.; Mathew, S.S.; Kaosar, M. Social media bot detection with deep learning methods: A systematic review. Neural Comput. Appl. 2023, 35, 8903–8918. [Google Scholar] [CrossRef]
Baruah, S.; Borah, D.J.; Deka, V. Detection of Peer-to-Peer Botnet Using Machine Learning Techniques and Ensemble Learning Algorithm. Int. J. Inf. Secur. Priv. (IJISP) 2023, 17, 16. [Google Scholar] [CrossRef]
Anwar, S.; Zolkipli, M.F.; Mezhuyev, V.; Inayat, Z. A Smart Framework for Mobile Botnet Detection Using Static Analysis. KSII Trans. Internet Inf. Syst. (TIIS) 2020, 14, 2591–2611. [Google Scholar]
Rasheed, M.M.; Faieq, A.K.; Hashim, A.A. Android botnet detection using machine learning. Ingénierie Syst. D’Informa. 2020, 25, 127–130. [Google Scholar] [CrossRef]
Alissa, K.; Alyas, T.; Zafar, K.; Abbas, Q.; Tabassum, N.; Sakib, S. Botnet Attack Detection in IoT Using Machine Learning. Comput. Intell. Neurosci. 2022, 2022, 4515642. [Google Scholar] [CrossRef] [PubMed]
Raghavan, A.; Di Troia, F.; Stamp, M. Hidden Markov models with random restarts versus boosting for malware detection. J. Comput. Virol. Hacking Tech. 2019, 15, 97–107. [Google Scholar] [CrossRef]
Ali, M.; Hamid, M.; Jasser, J.; Lerman, J.; Shetty, S.; Di Troia, F. Profile Hidden Markov Model Malware Detection and API Call Obfuscation. In Proceedings of the ICISSP, Online, 9–11 February 2022; pp. 688–695. [Google Scholar]
García, S.; Grill, M.; Stiborek, J.; Zunino, A. An empirical comparison of botnet detection methods. Comput. Secur. 2014, 45, 100–123. [Google Scholar] [CrossRef]

Figure 1. Botnet architecture overview.

Figure 2. Botnet lifecycle phases.

Figure 3. Plot of ACF for periodic and non-periodic data.

Figure 4. Hidden Markov Model [6].

Figure 5. Profile Hidden Markov Model [9].

Figure 6. Implementation overview.

Figure 7. Data pre-processing overview.

Figure 8. Network graph design.

Figure 9. Combined autocorrelation plots.

Figure 10. Comparative confusion matrices.

Table 1. Hidden Markov Model notation [6].

Notation	Description
T	Length of the observation sequence
N	Number of hidden states in the model
M	Number of distinct observation symbols
Q	Distinct states of Markov process, denoted as ${q_{0}, q_{1}, \dots, q_{N - 1}}$
V	Possible observations, denoted as ${0, 1, \dots, M - 1}$
A	$N \times N$ state transition probability matrix
B	$N \times M$ observation probability matrix
$π$	$1 \times N$ initial state distribution matrix
O	Observation sequence, denoted as $(O_{0}, O_{1}, \dots, O_{T - 1})$

Table 2. Profile Hidden Markov Model notation [9].

Notation	Description
X	Emitted symbols, ${X_{1}, X_{2}, \dots, X_{m}}$ where $m \leq N + 1$
N	Number states in the model
M	Match states, ${M_{1}, M_{2}, \dots, M_{N}}$
I	Insert states, ${I_{0}, I_{1}, \dots, I_{N}}$
D	Delete states, ${D_{1}, D_{2}, \dots, D_{N}}$
A	$N \times N$ state transition probability matrix
E	$N \times M$ emission probability matrix
$π$	$1 \times N$ initial state distribution
$a_{M_{i} M_{i + 1}}$	Transition probability matrix from $M_{i}$ to $M_{i + 1}$
$ϵ_{M_{i}} (k)$	Emission probability of symbol k at state $M_{i}$
$λ$	The PHMM, $λ = (A, E, π)$

Table 3. CTU 13 Dataset scenarios with botnet types and protocols.

Scenario No.	Botnet Type	Protocol Used
1, 2, 9	Neris	HTTP (Hypertext Transfer Protocol)
3, 4, 10, 11	Rbot	IRC (Internet Relay Chat)
5, 13	Virut	HTTP (Hypertext Transfer Protocol)
6	Donbot	IRC (Internet Relay Chat)
7	Sogou	HTTP (Hypertext Transfer Protocol)
8	Murlo	IRC (Internet Relay Chat)
12	NSIS.ay	P2P (Peer-to-Peer) Protocols

Table 4. Description of features of CTU-13 dataset.

Feature	Description
StartTime	Timestamp of data indicating when they were recorded (hh:mm:ss)
Dur	Duration of associated network flow (in seconds)
Proto	Protocol used: tcp, udp, rdp, rtp, pim, icmp, ipx/spx, arp, igmp, rarp, unas, udt, esp, ipv6, ipv6-icmp
SrcAddr	Source IP address
Sport	Port number at the source
Dir	Direction of the network flow: -> (outgoing), <- (incoming), <-> (bidirectional), <?> (unknown), who (identity lookup), <? (unknown source), ?> (unknown destination)
DstAddr	Destination IP address
Dport	Destination port number
State	Transaction state associated with the protocol
sTos	Source type of service
dTos	Destination type of service
TotPkts	Total transmitted packets
TotBytes	Total transmitted bytes
SrcBytes	Number of bytes transmitted from source to destination
Label	Traffic labels as normal, background, botnet

Table 5. Data after removing unwanted protocols and states.

Attributes	Sample 1	Sample 2	Sample 3
StartTime	15-08-2011 16:43:20.935	15-08-2011 16:43:20.937	15-08-2011 16:43:20.949
Duration	632.765	0.162	1793.589
Protocol	udp	tcp	icmp
SrcAddr	151.51.231.119	147.32.84.229	147.32.84.174
DstAddr	147.32.86.44	66.56.30.27	147.32.96.69
State	CON	CON	URP
TotPkts	32,055	2	96
TotBytes	8,577,197	567	56,640
SrcBytes	8,575,244	95	56,640
Label	flow = Background-UDP-Established	flow = Background-TCP-Established	flow = Background

Table 6. Extracted features list after data pre-processing.

Sr. No.	Feature Name	Description	Category
1	DurMean	Duration mean	Statistical Feature
2	DurMedian	Duration median	Statistical Feature
3	DurStd	Duration std. dev.	Statistical Feature
4	DurMin	Duration minimum	Statistical Feature
5	DurMax	Duration maximum	Statistical Feature
6	DurRange	Duration range	Statistical Feature
7	TotByteMean	Total bytes mean	Statistical Feature
8	TotByteMedian	Total bytes median	Statistical Feature
9	TotByteStd	Total bytes std. dev.	Statistical Feature
10	TotByteMin	Total bytes minimum	Statistical Feature
11	TotByteMax	Total bytes maximum	Statistical Feature
12	TotByteRange	Total bytes range	Statistical Feature
13	TotPktMean	Total packets mean	Statistical Feature
14	TotPktMedian	Total packets median	Statistical Feature
15	TotPktStd	Total packets std. dev.	Statistical Feature
16	TotPktMin	Total packets minimum	Statistical Feature
17	TotPktMax	Total packets maximum	Statistical Feature
18	TotPktRange	Total packets range	Statistical Feature
19	TotBytesACF	Total bytes ACF	Autocorrelation Feature
20	TotBytesPeriodicity	Total bytes periodicity	Autocorrelation Feature
21	TotBytesStrength	Total bytes strength	Autocorrelation Feature
22	TotPktsACF	Total packets ACF	Autocorrelation Feature
23	TotPktsPeriodicity	Total packets periodicity	Autocorrelation Feature
24	TotPktsStrength	Total packets strength	Autocorrelation Feature
25	DurACF	Duration ACF	Autocorrelation Feature
26	DurPeriodicity	Duration periodicity	Autocorrelation Feature
27	DurStrength	Duration strength	Autocorrelation Feature
28	Label	‘Botnet’ or ‘Normal’	Label
29	Class	Botnet type (7 in total)	Label
30	DegreeCentrality	Degree centrality	Topology Feature

Table 7. HMM results for N = 8.

Scenario No.	Botnet Type	Training Accuracy	Testing Accuracy	Precision	Recall	F1-Score
1, 2, 9	Neris	0.9871	0.9844	0.7727	0.9883	0.8673
3, 4, 10, 11	Rbot	0.9811	0.9767	0.6727	0.9610	0.7914
5, 13	Virut	0.9402	0.9259	0.5823	0.9101	0.7102
6	Donbot	0.9989	0.9970	0.7058	0.9999	0.8275
7	Sogou	0.9853	0.9600	0.6646	0.9008	0.7649
8	Murlo	0.9623	0.9498	0.6784	0.9885	0.8046
12	NSIS.ay	0.9599	0.9480	0.7046	0.9801	0.8198
Normal	None (Normal Traffic)	0.8122	0.7600	0.9538	0.5449	0.6935

Table 8. HMM results for N = 2.

Traffic	Training Accuracy	Testing Accuracy	Precision	Recall	F1-Score
Botnet	0.8319	0.8223	0.9177	0.7077	0.7991
Normal	0.8319	0.8223	0.7626	0.9366	0.8407

Table 9. PHMM results.

Scenario No.	Botnet Type	Training Accuracy	Testing Accuracy	Precision	Recall	F1-Score
1, 2, 9	Neris	0.9932	0.9911	0.8455	0.9776	0.9068
3, 4, 10, 11	Rbot	0.9866	0.9715	0.6865	0.9355	0.7919
5, 13	Virut	0.9612	0.9580	0.7406	0.9916	0.8480
6	Donbot	0.9849	0.9739	0.8835	0.9125	0.8977
7	Sogou	0.9946	0.9964	0.9846	0.9901	0.9873
8	Murlo	0.9813	0.9933	0.8958	0.9247	0.9100
12	NSIS.ay	0.9970	0.9929	0.9074	0.9245	0.9158
Normal	None (Normal Traffic)	0.9047	0.8915	0.9535	0.7906	0.8644

Table 10. Comparison of average accuracies.

Model	Average Training Accuracy	Average Test Accuracy
HMM	0.9533	0.9377
PHMM	0.9754	0.9710

Table 11. Comparison of Average F1 Scores.

Model	Average F1 Score
HMM	0.7849
PHMM	0.8902

Table 12. Comparison with other models.

Model	F1-Score	Accuracy
Decision Trees	0.79	0.92
SVM	0.14	0.68
kNN	0.86	0.94
Naïve Bayes	0.53	0.78
MLP	0.99	0.99
HMM	0.78	0.94
PHMM	0.89	0.97

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Mannikar, R.; Di Troia, F. Enhancing Botnet Detection in Network Security Using Profile Hidden Markov Models. Appl. Sci. 2024, 14, 4019. https://doi.org/10.3390/app14104019

AMA Style

Mannikar R, Di Troia F. Enhancing Botnet Detection in Network Security Using Profile Hidden Markov Models. Applied Sciences. 2024; 14(10):4019. https://doi.org/10.3390/app14104019

Chicago/Turabian Style

Mannikar, Rucha, and Fabio Di Troia. 2024. "Enhancing Botnet Detection in Network Security Using Profile Hidden Markov Models" Applied Sciences 14, no. 10: 4019. https://doi.org/10.3390/app14104019

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Enhancing Botnet Detection in Network Security Using Profile Hidden Markov Models

Abstract

1. Introduction

2. Background

2.1. Botnets

2.2. Autocorrelation Analysis

2.3. Degree of Centrality

2.4. Hidden Markov Models

2.5. Profile Hidden Markov Models

3. Literature Review and Proposed Research

4. Methodology and Implementation

4.1. Data Pre-Processing

4.2. Hidden Markov Model (HMM)

4.3. Profile Hidden Markov Model (PHMM)

5. Results

5.1. Comparison with Other Models

5.2. Theoretical, Practical, Societal, and Methodological Implications

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI