Online Mining Intrusion Patterns from IDS Alerts

Featured Application: In this paper, an inﬂuence model is proposed to tackle the sequence data analysis problems such as disordering, element missing and random noises. The proposed method can be used for mining intrusion patterns from the intrusion action sequence extracted from IDS (Intrusion Detection System) alerts. Abstract: The intrusion detection system (IDS) which is used widely in enterprises, has produced a large number of logs named alerts, from which the intrusion patterns can be mined. These patterns can be used to construct the intrusion scenarios or discover the ﬁnal objectives of the malicious actors, and even assist the forensic works of network crimes. In this paper, a novel algorithm for the intrusion pattern mining is proposed which aimsto solve the di ﬃ cult problems of the intrusion action sequence such as the loss of important intrusion actions, the disorder of the action sequence and the random noise actions. These common problems often occur in the real production environment which cause serious performance decrease in the analyzing system. The proposed algorithm is based on the online analysis of the intrusion action sequences extracted from IDS alerts, through calculating the inﬂuences of a particular action on the subsequent actions, the real intrusion patterns are discovered. The experimental results show that the method is e ﬀ ective in discovering pattern from the complex intrusion action sequences.


Introduction
According to the kill chain proposed by Bryant et al. [1], the inside multi-step attack has become the main part of the intrusion process, which includethe pre hack (reconnaissance and delivery), hack (installation and privilege escalation), compromise (lateral movement, actions on objective) and theft (exfiltration). Currently, many sophisticated intrusion attacks [2][3][4] start from the inside of victim networks through the spear-phishing email or the host with vulnerabilities. Once the victim host executes the malicious code, the attacker begins to explore the whole network to discover the resources they need for further attacking. The attacker will hide the traits and avoid detection during intrusion [5,6]. In a word, it is critical to enhancing the existing multi-step attack detection abilities inside the network.
However, the multi-step attack detection approaches are mostly based on the intrusion detection system (IDS) sensors deployed in the hosts or the entrance of the network and are facing more severe challenges: (1) high false positives, which leads to the insertion of irrelevant intrusion actions into the data stream; (2) high redundancy, which is caused by the IDS detection mechanisms or the intended the destination port field of alerts derived from the same group. In this phase, most redundant and repeated alerts are merged.
In the intrusion session pruning phase, a long action sequence can be divided into several short sequences (intrusion sessions) by calculating the average time interval of actions. Then a pruning process begins to remove the repeat sub-patterns from the original sequence. The pruning algorithm can significantly reduce the length of the intrusion sessions without destroying the original associations of actions.
In the correlation discovery phase, all the pruned sessions will be fed to the correlation discovery module according to the start time of the session. The BIF algorithm is applied to calculate the attraction levels between any two actions of the session. The influence factor (IF) values which express the attraction level will increase or decrease with the incoming data over time, then the real association relations are built.
The dynamic correlation graph (DCG) is constructed based on the discovered correlations with higher IF values, the DCG links and nodes are dynamically created or destroyed with the influence factor matrix (IFM) which is the matrix maintaining all IF values and updating them in real-time.
In this paper, the proposed BIF algorithm will be introduced in detail, it can be leveraged either as an optimized method for the intrusion scenario discovery task of our former work [15] or a separate intrusion pattern mining system.
The proposed algorithm is based on the assumptions: the distance of two actions in a session can be used to measure the association strength of them.

Materials and Methods
The network environment is always complex and unpredictable, the problems caused by the network environment can badly impact the network security systems. In addition, the IDS can also cause problems such as redundant alerts and repeated patterns.
As shown in Figure 1, S1 is the correct intrusion session extracted under the experimental environment, while S2, S3, and S4 are three different states in a running environment. The action sequence is altered due to the different configurations of the network and the security systems. Therefore, it is necessary to pay attention to these problems and try to minimize their impact.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 3 of 15 In the intrusion action extraction phase, it groups alerts by two fields: the source IP address and the destination IP address. Then the intrusion actions are extracted based on the type field and the destination port field of alerts derived from the same group. In this phase, most redundant and repeated alerts are merged.
In the intrusion session pruning phase, a long action sequence can be divided into several short sequences (intrusion sessions) by calculating the average time interval of actions. Then a pruning process begins to remove the repeat sub-patterns from the original sequence. The pruning algorithm can significantly reduce the length of the intrusion sessions without destroying the original associations of actions.
In the correlation discovery phase, all the pruned sessions will be fed to the correlation discovery module according to the start time of the session. The BIF algorithm is applied to calculate the attraction levels between any two actions of the session. The influence factor (IF) values which express the attraction level will increase or decrease with the incoming data over time, then the real association relations are built.
The dynamic correlation graph (DCG) is constructed based on the discovered correlations with higher IF values, the DCG links and nodes are dynamically created or destroyed with the influence factor matrix (IFM) which is the matrix maintaining all IF values and updating them in real-time.
In this paper, the proposed BIF algorithm will be introduced in detail, it can be leveraged either as an optimized method for the intrusion scenario discovery task of our former work [15] or a separate intrusion pattern mining system.
The proposed algorithm is based on the assumptions: the distance of two actions in a session can be used to measure the association strength of them.

Materials and Methods
The network environment is always complex and unpredictable, the problems caused by the network environment can badly impact the network security systems. In addition, the IDS can also cause problems such as redundant alerts and repeated patterns.
As shown in Figure 1, S1 is the correct intrusion session extracted under the experimental environment, while S2, S3, and S4 are three different states in a running environment. The action sequence is altered due to the different configurations of the network and the security systems. Therefore, it is necessary to pay attention to these problems and try to minimize their impact. Figure 1. The illustration of the real intrusion sessions. S1, S2, and S3 are three intrusion sessions; A-H is intrusion actions that composed an intrusion scenario; S1 denotes the original session state; S2 is disordered and some irrelevant actions engaged; B is lost in S3; B-C in S4 is repeated due to the intrusion detection system (IDS).
For some IDS introduced problems, a few methods were proposed in our previous work [16,17], the redundant alerts, for example, can be reduced in the action extraction phase, and a few repeated action patterns can be removed in the session pruning phase by the pruning algorithm. In this paper, we mainly focus on the incomplete and disordered session (sequence) learning and attack pattern discovery in real-time. Figure 1. The illustration of the real intrusion sessions. S1, S2, and S3 are three intrusion sessions; A-H is intrusion actions that composed an intrusion scenario; S1 denotes the original session state; S2 is disordered and some irrelevant actions engaged; B is lost in S3; B-C in S4 is repeated due to the intrusion detection system (IDS).

Definitions
For some IDS introduced problems, a few methods were proposed in our previous work [16,17], the redundant alerts, for example, can be reduced in the action extraction phase, and a few repeated action patterns can be removed in the session pruning phase by the pruning algorithm. In this paper, we mainly focus on the incomplete and disordered session (sequence) learning and attack pattern discovery in real-time.

Definitions
The problem domain can be formalized as follows. An intrusion action a i ∈ A consists of a group of alerts, where A denotes the set of all unique actions, and |A| denotes the size of A. An intrusion It is assumed that an intrusion action has an attraction effect on the subsequent actions, and a particular action happens due to the attractions of one or more other actions which happened in different sessions. The degree of influence can be calculated and used to measure the association strength between actions.
For a given session s has a direct influence on a (xy) i2 , and indirect influence on a (xy) i3 , the degree of influence will get lower with the longer distance between a (xy) i1 and a (xy) in . The influence range is 1 ≤ ω ≤ w, where w is the number of actions influenced. For a given ω, if an intrusion action B falls in the influence range of A, A influents B (A attracts B) which can be denoted as A → B . The algorithm will calculate the influences on the actions that fall in the influence range of specified action in the session. The influence range is shown in Figure 2. It is assumed that an intrusion action has an attraction effect on the subsequent actions, and a particular action happens due to the attractions of one or more other actions which happened in different sessions. The degree of influence can be calculated and used to measure the association strength between actions.
For a given session ( )  The influence degree of i a on j a can be calculated using Equation (1): where i a and j a denotes the ith and jth action of the session, respectively, ω is the influence range, and |s| denotes the length of the session. The more actions exist between two actions, the lower their influence is and the lower their association strength is. The influence of each pair of actions will update the old values recorded in the influence matrix shown in Figure 3. In the matrix, the values in a row express the attraction strengths of the particular action on other actions, and the values in a column express the attraction strengths of other actions on the particular action.  The influence degree of a i on a j can be calculated using Equation (1): where a i and a j denotes the ith and jth action of the session, respectively, ω is the influence range, and |s| denotes the length of the session. The more actions exist between two actions, the lower their influence is and the lower their association strength is. The influence of each pair of actions will update the old values recorded in the influence matrix shown in Figure 3. In the matrix, the values in a row express the attraction strengths of the particular action on other actions, and the values in a column express the attraction strengths of other actions on the particular action.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 4 of 15 The problem domain can be formalized as follows. An intrusion action i a A ∈ consists of a group of alerts, where A denotes the set of all unique actions, and |A| denotes the size of A. An intrusion session which is a sequence of actions ordered by their time field, where x and y denote the two hosts from which the session is extracted.
It is assumed that an intrusion action has an attraction effect on the subsequent actions, and a particular action happens due to the attractions of one or more other actions which happened in different sessions. The degree of influence can be calculated and used to measure the association strength between actions.
For a given session ( )  The influence degree of i a on j a can be calculated using Equation (1): where i a and j a denotes the ith and jth action of the session, respectively, ω is the influence range, and |s| denotes the length of the session. The more actions exist between two actions, the lower their influence is and the lower their association strength is. The influence of each pair of actions will update the old values recorded in the influence matrix shown in Figure 3. In the matrix, the values in a row express the attraction strengths of the particular action on other actions, and the values in a column express the attraction strengths of other actions on the particular action.  With the continuous arrival of the intrusion sessions, the two actions A and B may have different influence values in different sessions, the old influence in the matrix should be updated with the new values using Equation (2): Appl. Sci. 2020, 10, 2983 where α indicates the refreshing rate, F o denotes the original value of influence recorded in the matrix, F n denotes the newly calculated influence value. Note that, the calculated F o may increase or decrease. The number of sessions contains A → B can be denoted as f (A → B) , and the total number of sessions in an analyzing period can be denoted as |S|. The probability of A → B can be calculated using Equation (3): The comprehensive influence of intrusion action A on B can be calculated using Equation (4): where with Equation (4), the influence strength of A on B can be calculated based on the historical observation. For a given intrusion action A, what predictions will the algorithm make? First, the influenced actions by A will be collected as the candidate predictions, second, the comprehensive influences are calculated, and the strongly influenced actions will be selected and predicted.

Results
In this section, the evaluation results of the proposed algorithm are discussed. Four types of datasets are generated automatically and the algorithm is tested based on them. The experimental results are discussed separately.

Evaluations On The Automatically Generated Dataset
We extracted the intrusion sessions based on the CICIDS2017 intrusion detection evaluation dataset [18] in the previous work [15], but the extracted sessions are not complicated and varied enough to evaluate the proposed method. In addition, we mainly focused on the sequence analyzing performance and intrusion pattern discovery accuracy, rather than the data preprocessing, however, an effective data preprocessing can increase the capacity of the proposed method.
According to the characteristics of intrusion patterns extracted from real data in our former work, we generated multiple datasets with each having different types of defects described in Section 2.

Baseline Dataset
First, we generate a standard baseline dataset for comparison tests with other datasets. The baseline dataset contains 10 different intrusion patterns (action sequences) involving 90 unique actions and 500 random background action sequences. Each intrusion pattern shares no common actions with others, and contains no irrelevant actions, in a word, they have no data problem. The details of the generated dataset are listed in Table 1. The pattern sessions used to generate the baseline dataset are listed in Table 2: Table 2. Generated pattern sessions.
We first tested the method with the baseline data set. After learning the dataset, the different actions were fed to the algorithm to check the prediction results by calculating the top average influences. The result shows that the algorithm can predict the future action sequence accurately due to the clean session patterns. The test results are listed in Table 3. Table 3. Prediction results based on the baseline dataset.

Inputs
TopPredicted Actions (Average Influence Factor) Based on the data shown in Table 3, the algorithm works well with the baseline dataset. The top average influence will be used to determine what is probably the next action. The baseline dataset is only used to check whether our thinking is feasible.
In the following evaluations, three datasets each with a different data defect are generated and used to test the proposed algorithm. In the end, a dataset with all three types of data defects are generated to simulate the real data environment, and the BIF algorithm is fully tested.

Noisy Dataset
By only inserting random irrelevant actions into each pattern session of the baseline dataset to generate a noisy dataset, the number of noise action inserts into each of the pattern sessions can be specified. We inserted 1-10 noise actions (10 noise levels) into the random positions of each pattern session to observe the prediction errors which are compared to the baseline dataset. There were 10 tests for each noise level, for each test, the data set was regenerated to ensure sufficient randomness. The system made 10 predictions based on different inputs and the prediction errors were counted. The average influences of each candidate action were calculated and the top n actions with the highest average influences were selected as the prediction results which were compared with the predictions shown in Table 3.
A part of the generated noisy dataset is shown in Figure 4, of which each row contains an intrusion session. The actions that start with "b" are the noise actions, and all the actions are divided by a comma.
The average influences of each candidate action were calculated and the top n actions with the highest average influences were selected as the prediction results which were compared with the predictions shown in Table 3.
A part of the generated noisy dataset is shown in Figure 4, of which each row contains an intrusion session. The actions that start with "b" are the noise actions, and all the actions are divided by a comma. The accuracy of the analysis based on the noisy dataset is shown in Table 4. The evaluation results are shown in Figure 5: The accuracy of the analysis based on the noisy dataset is shown in Table 4. The evaluation results are shown in Figure 5: Appl. Sci. 2020, 10, x FOR PEER REVIEW 8 of 15 As shown in Figure 5, the average accuracy of the prediction decreases with the increasing of the noise level. When inserting 10 noise actions into each pattern sessions, the prediction accuracy stays above 93.7%. The average accuracy of predictions based on the input actions is 96.3%. In a word, the proposed algorithm can learn the patterns from the noisy sequences with a higher accuracy.

Disordered Dataset
In this section, we will test the algorithm using the disordered dataset generated based on the baseline data. The term disordered in this paper means the action in an intrusion session (action sequence) changed their position due to various reasons-the session is called the disordered session. As shown in Figure 5, the average accuracy of the prediction decreases with the increasing of the noise level. When inserting 10 noise actions into each pattern sessions, the prediction accuracy stays above 93.7%. The average accuracy of predictions based on the input actions is 96.3%. In a word, the proposed algorithm can learn the patterns from the noisy sequences with a higher accuracy.

Disordered Dataset
In this section, we will test the algorithm using the disordered dataset generated based on the baseline data. The term disordered in this paper means the action in an intrusion session (action sequence) changed their position due to various reasons-the session is called the disordered session. The action that changed its original position is called the disordered action. The disorder level of a dataset is the average number of disordered actions in each pattern session.
There are five disorder levels, and 10 tests for each disorder level. For each test, two datasets will be generated: the training dataset and the testing dataset, each dataset is generated separately. The disordered actions in each pattern session are randomly selected and moved to a random position in the session. The pattern sessions each with 10 actions are shown in Table 5. The appear times means the number of appearances of each disordered pattern sessions in the dataset. For each test, an input action will be fed to the system for prediction. The testing results are shown in Table 6. There are more prediction errors than the result of the noisy data evaluation. The average accuracy of prediction for each pattern session is 89.7%. The relation between prediction accuracy and the disorder level is illustrated in Figure 6. There are more prediction errors than the result of the noisy data evaluation. The average accuracy of prediction for each pattern session is 89.7%. The relation between prediction accuracy and the disorder level is illustrated in Figure 6. The proposed algorithm is more sensitive to the disordered data than the noisy data, that is because it is the major method for the algorithm to learn patterns by accumulating the influence between ordered actions, if the orders of actions are changing frequently, it will be hard for the algorithm to learn any patterns. With 50% disordered actions in each session, the prediction accuracy The proposed algorithm is more sensitive to the disordered data than the noisy data, that is because it is the major method for the algorithm to learn patterns by accumulating the influence between ordered actions, if the orders of actions are changing frequently, it will be hard for the algorithm to learn any patterns. With 50% disordered actions in each session, the prediction accuracy is 85% which is an acceptable result.

Incomplete Dataset
For the incomplete dataset, which refers to the situation that actions in an intrusion session are lost for various reasons. The missing action will not change the order of other actions in the session; however, it will cause serious problems in the data mining systems and the statistical systems, especially the sequential learning systems.
The proposed method can solve the missing data problem by creating the association relations between actions. The influence relation of two actions becomes stronger if they always appear in the same session and are close to each other, otherwise, the influence relation will weaken. So, the missing action will not impact the learned patterns.
The evaluation test is similar to the disordered data evaluation. The six types of pattern sessions are the same as the ones in Table 5. There are five missing-levels for each intrusion sessions. For instance, level 2 means two random actions of the pattern sessions are lost. For each missing level, 10 predictions are made for the particular input by the system. The prediction results are shown in Table 7. The average accuracy of the predictions made for particular input is 96%, which means the algorithm has a strong resistance to the incomplete data. The relation between the missing level and the prediction accuracy is shown in Figure 7. The average accuracy of the predictions made for particular input is 96%, which means the algorithm has a strong resistance to the incomplete data. The relation between the missing level and the prediction accuracy is shown in Figure 7. The algorithm will first analyze the training dataset and then load the testing dataset which is generated with the same parameters of the training dataset, all the datasets contain the pattern sessions randomly generated and the background sessions. The experimental results show that the missing actions can hardly impact the proposed algorithm. The average prediction accuracy has been kept above 93% when even 40% of the actions are missing. The algorithm will first analyze the training dataset and then load the testing dataset which is generated with the same parameters of the training dataset, all the datasets contain the pattern sessions randomly generated and the background sessions. The experimental results show that the missing actions can hardly impact the proposed algorithm. The average prediction accuracy has been kept above 93% when even 40% of the actions are missing.

The Heterogeneous Dataset
Combined with the three types of data defects described above, we generated the final version of the heterogeneous dataset containing 50% of noise actions, 20% of disordered actions and 20% missing actions. There is only one type of intrusion session that expresses the unknown intrusion pattern in the dataset. The BIF algorithm should analyze the overall data to discover the unknown intrusion pattern. The actions of the unknown intrusion pattern will also appear randomly in other sessions. Finally, the discovered pattern will be compared with the initial session.
In this experiment, the dataset will be firstly described, and then, the experimental results will be discussed. We want to know how bad data the algorithm can analyze to discover the whole unknown pattern and the performance of finding effective patterns.
First, the composition of the generated dataset is listed in Table 8: As shown in Table 8, the first type data are the pattern session which is the target pattern <a 1 , a 2 , . . . , a 10 >, the actions which are related to the target pattern will be named with the prefix "a" to distinguish them. When generating the pattern sessions, it will obfuscate these sessions with the following steps: • Inserting noise actions. The noise actions are named with the prefix "b", for example, b 1 , b 2 .

•
Removing actions. The pattern related actions may be removed from the session randomly to simulate the incomplete data.    There are 50 types of sessions contained in the irrelevant sessions shown in Table 8;, they are ordinary patterns used to mislead the algorithm by containing a few actions appeared in the target pattern sessions at a specified percentage to simulate the real data environment. Intrusion patterns always share common actions with other sessions. The actions of the irrelevant sessions are named with the prefix "c". These irrelevant intrusion patterns are also obfuscated through the steps There are 50 types of sessions contained in the irrelevant sessions shown in Table 8;, they are ordinary patterns used to mislead the algorithm by containing a few actions appeared in the target pattern sessions at a specified percentage to simulate the real data environment. Intrusion patterns always share common actions with other sessions. The actions of the irrelevant sessions are named with the prefix "c". These irrelevant intrusion patterns are also obfuscated through the steps described above. The generated data are shown in Figure 9. There are 50 types of sessions contained in the irrelevant sessions shown in Table 8;, they are ordinary patterns used to mislead the algorithm by containing a few actions appeared in the target pattern sessions at a specified percentage to simulate the real data environment. Intrusion patterns always share common actions with other sessions. The actions of the irrelevant sessions are named with the prefix "c". These irrelevant intrusion patterns are also obfuscated through the steps described above. The generated data are shown in Figure 9. The third type of data are generated in a random way to increase the total data volume and introduce the data sparsity problem. In addition, these sessions have a specified probability to contain the actions appeared in the first type of data. The generated data of this type are shown in Figure 10.  The third type of data are generated in a random way to increase the total data volume and introduce the data sparsity problem. In addition, these sessions have a specified probability to contain the actions appeared in the first type of data. The generated data of this type are shown in Figure 10. There are 50 types of sessions contained in the irrelevant sessions shown in Table 8;, they are ordinary patterns used to mislead the algorithm by containing a few actions appeared in the target pattern sessions at a specified percentage to simulate the real data environment. Intrusion patterns always share common actions with other sessions. The actions of the irrelevant sessions are named with the prefix "c". These irrelevant intrusion patterns are also obfuscated through the steps described above. The generated data are shown in Figure 9. The third type of data are generated in a random way to increase the total data volume and introduce the data sparsity problem. In addition, these sessions have a specified probability to contain the actions appeared in the first type of data. The generated data of this type are shown in Figure 10.  The three types of data are generated separately, however, they will be randomly mixed into the data stream fed to the analyzing algorithm. About 2350 sessions are generated, and more than 20,000 actions are created in the dataset. It takes about 600 milliseconds to finish the whole process including data-generating, analyzing, file writing and results calculating. All the tests in this paper are running on a server with 8 GB RAM and 2.6 GHz Intel CPU, on which Windows 10 operating system is installed, and the Java programing language is used to implement all the tests.
Based on the dataset with 20% disordered actions, 50% noise actions and 20% missing actions, and the pattern sessions sharing 20% of actions with the ordinary patterns sessions, the target intrusion pattern discovery result is illustrated in Figure 11: data-generating, analyzing, file writing and results calculating. All the tests in this paper are running on a server with 8 GB RAM and 2.6 GHz Intel CPU, on which Windows 10 operating system is installed, and the Java programing language is used to implement all the tests.
Based on the dataset with 20% disordered actions, 50% noise actions and 20% missing actions, and the pattern sessions sharing 20% of actions with the ordinary patterns sessions, the target intrusion pattern discovery result is illustrated in Figure 11: As shown in Figure 12, the orange circle denotes the action, and the attraction list lists the top five actions that are heavily influenced by the specified action. The first bold action in the list is the next action that the particular action will point to and the float numbers in the list are the comprehensive influence factors. There are two orange arrows in Figure 12 pointing at the wrong actions comparing with the initial pattern which means the proposed method needs more adjustment and optimization. The influence threshold β = 0.01 can be used to filter out the actions with influences below β , thus, there are no subsequent actions after a10. The correlation graph can be updated over time and the links between actions will be altered too.  As shown in Figure 12, the orange circle denotes the action, and the attraction list lists the top five actions that are heavily influenced by the specified action. The first bold action in the list is the next action that the particular action will point to and the float numbers in the list are the comprehensive influence factors. There are two orange arrows in Figure 12 pointing at the wrong actions comparing with the initial pattern which means the proposed method needs more adjustment and optimization. The influence threshold β = 0.01 can be used to filter out the actions with influences below β, thus, there are no subsequent actions after a10. The correlation graph can be updated over time and the links between actions will be altered too.
data-generating, analyzing, file writing and results calculating. All the tests in this paper are running on a server with 8 GB RAM and 2.6 GHz Intel CPU, on which Windows 10 operating system is installed, and the Java programing language is used to implement all the tests.
Based on the dataset with 20% disordered actions, 50% noise actions and 20% missing actions, and the pattern sessions sharing 20% of actions with the ordinary patterns sessions, the target intrusion pattern discovery result is illustrated in Figure 11: As shown in Figure 12, the orange circle denotes the action, and the attraction list lists the top five actions that are heavily influenced by the specified action. The first bold action in the list is the next action that the particular action will point to and the float numbers in the list are the comprehensive influence factors. There are two orange arrows in Figure 12 pointing at the wrong actions comparing with the initial pattern which means the proposed method needs more adjustment and optimization. The influence threshold β = 0.01 can be used to filter out the actions with influences below β , thus, there are no subsequent actions after a10. The correlation graph can be updated over time and the links between actions will be altered too.  The discovery accuracy is used to measure the ability of the system to discover patterns. It can be calculated by δ =|D|/|E| × 100% where |D| denotes the correct relations discovered by system, and |E| denotes all the relations the system should correctly discover.
The quantitative testing method is used to find out which factors have a potential impact on the accuracy of the proposed method. For the three types of data listed in Table 8, the relationship between the proportion of them in the dataset and the pattern discovery accuracy is tested. The results show that:

•
The number of background sessions has little effect on the discovery accuracy; • The irrelevant pattern sessions will impact the accuracy depending on the number of common actions sharing the target pattern sessions; • With the fixed percentage of noise data, disordered data and incomplete data, the discovery accuracy is mainly impacted by the appearance frequency of pattern sessions.
The experimental results are shown in Figure 12. If an intrusion pattern appears above 50 times in the dataset with the three types of data defects, the discovery accuracy can be 91% and remains stable.
If the percentage of actions sharing between target pattern sessions and the irrelevant pattern sessions is lower than 40%, the pattern identifying accuracy can be kept above 90%. This is effective for distinguishing the particular intrusion patterns from other similar intrusion patterns.

Discussion
In this paper, a novel sequence learning and mining algorithm is proposed to meet the challenges of network log (or the IDS logs) defects caused by various environmental problems. The algorithm is simple, lightweight and effective in discovering the patterns hiding in the network logs.
The proposed algorithm is based on the backward attraction calculation, which means the association relation between two intrusion actions can be measured by the distance of their indices in the intrusion session (action sequence). The nearer of their position in the session, the stronger the attraction between them. The attraction strength is updated and accumulated with the different distances of actions in different sessions. The actions with higher attractions are selected to construct the correlation graph which is used for attack prediction or attack scenario recognition.
Three types of automatically generated datasets each with a different type of data defects are tested on the algorithm, the results show that the proposed algorithm is effective in learning and mining the sequence patterns from the disordered, noisy and incomplete session data. The prediction tests are used to measure the learning ability of the algorithm, and the average prediction accuracy is kept above 90%. Finally, the heterogeneous dataset is generated by combining all data defects to simulate the real data environment, and the pattern discovery ability of the algorithm is measured in different conditions. The experimental results show that the algorithm can discover the unknown intrusion pattern in the dataset containing 50 other different intrusion patterns sharing 40% actions with the target pattern session, and the discovery accuracy is kept above 91%.
Although the algorithm is effective in mining the intrusion patterns in the complex dataset, it is still impacted by the high disordered sessions, if 50% of actions in each session is randomly changed their position, the accuracy of the algorithm will go down to 85%. Solving this problem will be the aim of subsequent studies.
The proposed algorithm is designed based on online unsupervised learning and with no complex parameter tuning and retraining. In addition, the network logs normalization, aggregation, and other clustering processes are preferred to prevent the explosive growth of the action types.