1. Introduction
Since the beginning of this century, cyberattacks have threatened users and organizations. They have become more complex and sophisticated with the development of computer networks. Nowadays, multistep attacks mostly aim at stealing important data from organizations and leads to serious data security problems.
A multistep attack, also called a multistage attack, intrusion scenario, refers to the ensemble of steps taken by one or several attackers with a single specific objective inside the network, containing at least two distinct actions [
1]. The multistep attack consists of multiple single-step attacks which can be detected by the IDS (intrusion detection system), however, the multistep attack is not easy to discover, having some difficulties:
- (1)
Although the IDS is efficient in detecting the single-step attack, it can hardly reveal the logical relations among these single-step attacks.
- (2)
The real attack is a small probability event, and the most dangerous attacks happen only rarely, which mean that in each dataset we have only a few examples of attacks. This problem in intrusion detection research, called the “rare data problem” [
2], seems even worse when multistep attacks are considered.
- (3)
IDS sensors could miss some steps of a multistep attack because these steps may appear normal to them [
3]. Thus, we cannot assume that the important alerts related to the real attack will always be there.
- (4)
High data redundancy and volumes will slow down the analysis process, which is a significant challenge for the development of systems working in real time.
In this paper, an intrusion scenario discovery framework is proposed to tackle these difficulties by data preprocessing and online correlation analysis. The proposed framework was designed based on the following considerations:
First, the framework is a totally online system and has no offline mining or learning part; it is different from other multistep attack detection systems [
4,
5] that need offline processing. The HTM algorithm can learn the association patterns of intrusion actions in real time and the learned patterns will evolve with the incoming data over time, which is a good characteristic for online learning because it provides a continuous learning mechanism for our framework. However, the “short-term memory” has its negative influence on the prediction accuracy, so, an auxiliary method is proposed to mitigate the influence.
Second, the framework is designed based on the processing of streaming data and does not limit the boundary or the quantity of data, which means the model should be constructed cumulatively.
Third, the framework needs to predict the intrusion at an early stage without waiting for the model to be completely built. Given any intrusion action, the proposed framework will output the predicted action sequence that may happen with a higher probability in the future.
This is the first time to our best knowledge that the HTM algorithm is combined with intrusion scenario discovery. HTM has some attractive features, such as strong noise resistance, continuous learning without retraining, and an unsupervised prediction-driven learning mechanism. The proposed framework can benefit from these features.
In this section, four categories of existing approaches for constructing the intrusion scenarios are discussed and, after that, some HTM use cases are introduced.
- (1)
Some early prediction models based on prior knowledge are created manually by security experts. The domain knowledge, such as known cyber threats or intrusion scenarios, are leveraged in the modeling process. These models can highly represent real intrusion scenarios, and provide a high accuracy of prediction with no ambiguity, which is the reason why they are popular in the industry. However, it is heavy work for the experts to keep the models always available for attack strategies that are changing all the time. Early studies used the approach proposed by Qin and Lee [
6], which can identify attack plans based on isolated attack scenarios, but appreciable adjustments should be made to predict the attacks that are beyond the predefined attack plans.
- (2)
Two types of graph models, HMM (hidden Markov models) and Bayesian networks, are commonly used in intrusion predictions. Essentially, they are attack graphs that can be constructed automatically. Statistical methods are utilized to obtain the transition probabilities or probability distributions based on which the prediction can be made. Ramaki et al. [
7] proposed an attack prediction model based on the previous work [
4], which is a real-time IDS alert correlation framework. In their model, Bayes networks and a Bayes attack graph (BAG) are employed to increase the accuracy of the prediction. Unlike the models based on Bayes networks, the HMM-based methods are insensitive to the missed states and transitions, which means it can work well without complete information of the intrusion scenarios [
8]. Farhadi et al. [
9] proposed the alert correlation and prediction framework utilizing the HMM model as the prediction component, which does not require the information about the network topology, system vulnerabilities, and system configurations, and is robust against over-fitting. However, the model cannot predict patterns of unknown attacks. Additionally, both of the models are sensitive to outliers and the lack of scalability for the various intrusion scenarios.
- (3)
Data mining is another prevalent technology for discovering correlations within the massive amount of security data before constructing the intrusion prediction models. Husák et al. [
10] proposed an attack prediction framework based on data mining to identify the attack patterns from a million alerts. Sequential rule mining is employed to derive rules for prediction. Kim et al. [
11] proposed an approach to create an attack graph based on data mining techniques for intrusion prediction. Although data mining can discover unknown intrusion patterns, redundant, small, frequent items would affect the accuracy of the prediction. The most important is that the data mining models cannot discover the critical data items with lower frequency.
- (4)
Since the ANN (artificial neural network)-related approaches are proposed to meet the intrusion prediction, many studies show that neural networks have a better result in predicting attacks than others. However, neural networks need to be trained offline and retrained in a regular time interval to meet the constantly changing data. Additionally, feature selection and dimensional reduction are two essential and complex tasks to conduct, and the parameters should be learned from the data. In other words, neural networks are heavy models and not optimal for online prediction in real-time. Subba et al. [
12] proposed an intrusion detection model which can reduce the computational resources in the training phase and is suitable for real-time deployment. As described above, it needs feature selection and daunting parameter tuning to gain better performance.
The HTM (Hierarchical Temporal Memory) algorithm is a new type of intelligent learning system inspired by the structures and mechanism of data processing of the mammalian neocortex. Although the HTM is still evolving with the latest discoveries in neurobiology, many applications of it are proposed for its outstanding characteristics. The existing use cases based on HTM theory are discussed in detail.
Cui et al. [
13] analyzed the properties of HTM sequence memory in the sequence learning and prediction tasks and compared the HTM sequence memory with other sequence learning algorithms. The model has better performance in continuous online learning, robustness to noise, fault tolerance, and no parameter tunings.
El-Ganainy et al. [
14] proposed a method to predict medical streams in real-time; the paper employs two machine learning algorithms: HTM (Hierarchical Temporal Memory) and LSTM (Long Short-Term Memory) to model the same medical data stream, and the experimental results show that the HTM algorithm is more effective than the LSTM algorithm in three aspects: no training process, no domain knowledge, and the ability to adapt to variations in the data.
Li et al. [
15] proposed a framework to detect the anomaly in the stream of air traffic data broadcasted between a flight and the air traffic control center. The HTM is used to encode the data and learns the patterns from the sequential data for the anomaly prediction, and then conducts the deviation analysis to identify the attack based on the former predictions.
Wang et al. [
16] proposed a distributed vehicle network anomaly detection framework with the abilities of continuous online learning and exception scoring. The framework can learn the network data sequence and detect the abnormal states of the vehicle networks.
Shah et al. [
17] proposed a framework to categorize documents using the SP (Spatial Pooling) algorithm of the HTM theory, and LSI (Latent Semantic Indexing) technology is used for extracting the top features from the input and converting them into binary format. The assessment shows that the HTM model is accurate compared to the conventional technologies used in text classification.
Shen et al. [
18] have proposed a gait pattern recognition approach using the extended HTM algorithm. In the paper, the authors decomposed the gait image into sequences of movements of each body parts, and learns the features through three levels of neurons. A Markov chain is used to discover the feature groups in which the mutual transition probabilities are above the threshold, then the HTM algorithm, which is extended with a dynamic programming algorithm, is used to infer the complex sequences. The HTM algorithm proves the effectiveness of visual problems.
The main contributions of this paper are: (1) to introduce a series of data preprocessing methods to refine the streaming data; (2) the HTM algorithm is introduced to enhance the ability of online learning; and (3) an auxiliary method is proposed to improve the accuracy of intrusion scenario discovery.
4. Discussion
The discussions of the experimental results are given in
Section 3.2. In this section, our works are concluded. The unsupervised online sequence learning and prediction system is proposed. Several measures have been introduced to enhance the performance of the system, such as intrusion action clustering, intrusion session reconstruction, and the optimizations of data processing. These measures can effectively reduce the data noises, redundancies, and false positives, making it easier to discover the potential intrusion behaviors and the correlations between them. The reduction of intrusion action types and unified intrusion sessions have greatly reduced the ambiguous correlations and further simplified the correlation graphs. We introduced the HTM cortical learning algorithms to improve the online sequence learning, based on the predictions of HTM, the correlation graph can be constructed in real-time and keeps updating over time. Offline training is not needed anymore, which makes it lightweight for online work.
Many aspects of the system should be studied and enhanced in the future: (1) the orders of actions in a session are important for the system; if all actions randomly change their relative positions frequently, the association patterns between actions are difficult to discover for the system, thus, a pattern stabilization method (maybe based on the statistics) which can correct the disordered actions before learning is appropriate for the system; (2) if multiple intrusion scenarios occurred in parallel and share some common intrusion actions, the framework will correlate them into one large scenario, which leads to the complex model, so the studies of parallel intrusion scenario construction is a direction; and (3) although the HTM algorithm can learn the unknown patterns of the sequence, it can only predict the next time step, rather than predicting a sequence of future actions. The parameters of the HTM algorithm should be carefully set, though it is easier to set than other machine learning algorithms.