Two-Phase PFAC Algorithm for Multiple Patterns Matching on CUDA GPUs

The rapid advancement of high speed networks has resulted in a significantly increasing number of network packets per second nowadays, implying network intrusion detection systems (NIDSs) need to accelerate the inspection of packet content to protect the computer systems from attacks. On average, the pattern matching process in a NIDS consumes approximately 70% of the overall processing time. The conventional Aho–Corasick (AC) algorithm, adopting a finite state machine to identify attack patterns in NIDSs, is too slow to meet the requirement of high speed networks. In view of this, several studies have used the features of a graphics processing unit (GPU) to improve the core searching process of the AC algorithm. For instance, parallel failureless Aho-Corasick (PFAC) algorithm improves the process of pattern matching effectively by removing backward branches in the original finite state machine created using the AC algorithm. In this way, boundary detection can be avoided totally if we allocate an individual thread to each byte of an input stream to identify any pattern starting at the thread’s starting position. However, through analysis, we found that this algorithm experiences a serious load imbalance problem. Therefore, this paper proposes a two-phase PFAC algorithm to address the problem. A threshold is predefined to divide execution into two phases, and the failureless finite state machine is also decoupled into two parts accordingly. In the first phase, every thread identifies patterns by running the tiny part of the decoupled failureless finite state machine that are stored in fast shared memory. In the second phase, all the threads requiring further searching in a same block are regrouped into a few warps for less branch divergence. According to experimental results, the proposed algorithm shows a performance improvement of 50% compared to the PFAC algorithm.


Introduction
A network intrusion detection system (NIDS) [1] is widely used to protect computer systems from hidden dangerous attacks such as denial of service, malware, and port scan.When analyzing network traffic or host login, it is necessary to compare thousands of attack patterns with continuously incoming network packets to filter out any possible attacks.Because the number of packets per second arrived in the network continuously increase, conventional string matching algorithms cannot efficiently analyze such an enormous amount of network traffic to meet the demand of high-speed data transmission.In this case, network congestion may occur or a large amount of data may be discarded, which can increase computer security threats or degrade the quality of network service.Much worse, the advancement of high speed networks has also resulted in an increasing number of patterns in NIDSs.For instance, the pattern rule set typically uses Snort rules, which contain more than 16,000 rules currently and are updated continuously [2,3].On the other hand, during pattern matching, one or more attack patterns can be discovered in a stream.
The process of pattern matching has a significant impact on NIDS performance.On average, the pattern matching process in a NIDS [1,2,4] consumes approximately 70% of the overall processing time.Therefore, considerable research has been conducted on improving the pattern matching process.Previous studies proposed various parallel approaches to accelerate the time-consuming process of pattern matching [5][6][7].Such parallel approaches utilize the features of different architectures such as distributed computing systems, multi-core CPUs, and graphics processing units (GPUs).Modern general-purpose GPUs are not only powerful graphics engines but also highly parallel programmable processors.Today's GPUs use thousands of parallel processor cores executing tens of thousands of parallel threads to rapidly solve large problems in various application domains.
The Aho-Corasick (AC) algorithm [8,9] is adopted by most NIDSs and it uses failure transitions to backtrack a state machine to recognize the prefix string of other patterns.The parallel failureless AC (PFAC) algorithm is extended from the AC algorithm and implemented especially on a GPU [10].In the PFAC algorithm, all backward state transitions in the finite state machine (FSM) in the AC algorithm are all eliminated and the derived one is called a failureless FSM.Furthermore, they let each byte in an input stream be the starting byte for a thread to match patterns using the failureless FSM.As a result, threads are independent of each other and boundary detection is totally avoided.
Although the PFAC algorithm outperforms the conventional AC algorithm substantially, it suffers from a sever load imbalance problem.Because a GPU adopts the parallel programming model of single instruction multiple threading, if one thread requires longer execution time for its job, all the other threads in a same warp need to wait for its completion even they have finished their work.In the PFAC algorithm, the numbers of bytes to be analyzed before decisions are made vary widely for different threads, resulting in that more and more threads become idle when one or a few of threads proceed their matching.
In this study, we propose a two-phase approach for pattern matching.Our approach is similar to the PFAC algorithm, i.e., each byte in an input stream is the starting character for a thread.In the first phase, the maximum number of bytes to be compared is set to be equal to a threshold value.Threads that do not fail matching in the failureless FSM at the end of the first phase proceed to the second phase.Moreover, we extract a tiny part of the the original failureless FSM and store it in shared memory for high-speed matching in the first phase.In the second phase, the matchings to be continued in the same block are merged into a few of warps for less divergent branches.Brach divergence is the main killer of performance in a GPU, in which different threads follow different control flow paths and the worst case is that thread execution is serialized.In the way of merging active threads into a few of warps, the system performance can be improved significantly.Experimental results demonstrate that our two-phase algorithm garners a performance improvement of 50% compared to the PFAC algorithm.
The rest of this paper is organized as follows: Section 2 presents related work on pattern matching, Section 3 describes the proposed method, Section 4 presents the experimental results, and Section 5 provides the conclusions and scope for future work.

Related Work
GPUs have been widely used in many fields to accelerate calculation-intensive problems, such as range query and nearest neighbors query [11][12][13][14][15], where range query is to find all the objects that are at a definite distance from the consulted object from a large collection of stored objects in a metric database [11,12] while nearest neighbors query is to find the k nearest neighbor objects to a given point in space [13][14][15].In addition, one research direction is to utilize highly parallel computing resources more efficiently: splitting large computation-intensive problems into thousands, or even more, of smaller subproblems that can be solved in a batch [16][17][18].These research projects aim at addressing the following problem.The resource requirements of Parallel tridiagonal linear systems might exceed the limitation of the GPU system, e.g., the size of shared memory and the maximum number of threads per block.For instance, cuThomasBatch and cuThomasVBatch are two CUDA (Compute Unified Device Architecture) routines to compute batch of tridiagonal linear systems, where cuThomasBatch is for fixed batch that the systems of the batch share the same size while cuThomasVBatch is for variable batch [16].The same idea also inspires the developments of cuHinesBatch [17] and BBLAS (Batched BLAS (Basic Linear Algebra Subroutine)) [18].cuHinesBatch simulates the behavior of human brain on CUDA GPUs with the Hines algorithm that is the standard algorithm used to compute the voltage of neurons' morphology [17].BBLAS is a batched BLAS (Basic Linear Algebra Subroutine) that is to perform BLAS operations in parallel on many small matrices.Recent study proposed a novel data layout to boost BBLAS performance on CUDA GPUs by interleaving the matrices in memory for better data vectorization and prefetching [18].
String matching (or String searching, Pattern matching) algorithms are to find one or several substrings (or patterns) in a large string [19,20].If only one substring is required for search, this kind of algorithms belongs to single-pattern matching, otherwise, multiple-pattern matching algorithms.The substrings or patterns to be searched can be predefined or not, which divide pattern matching into exact and inexact matching categories.For instance, virus and intrusion patterns are usually predefined and stored in a database for matching [21,22].However, in Bioinformatics, the LCS (Longest Common Subsequence) algorithm is used to find the longest subsequence common to two biological sequences but the subsequence has not need to occupy consecutive positions within the original sequences [22].The Smith-Waterman algorithm is used to calculate the similarity of two biological sequences, where the factor of mutation is considered in performing sequence alignment [23].
The main application of the LCS algorithm is on bioinformatics, where the similarity of two large DNA sequences is calculated.In addition, the LCS algorithm has also been applied to other fields, such as intrusion and virus detection [24].The target sequences of the LCS algorithm in bioinformatics and other fields have different characteristics.For instance, biological sequences are large and share a similar size but virus sequences are short and unbalanced in size.Most studies focused on how to parallelize LCS on bioinformatics but the hybrid-LCS (hLCS) algorithm was proposed to parallelize the LCS algorithm for other fields.hLCS adopts a two-phase approach combining the advantages of the row-based and the antidiagonal-based parallelization methods.In this way, synchronization overhead between threads can be alleviated significantly.Moreover, the compute powers of CPU and GPU are used together to accelerate the LCS algorithm for processing multiple short and unbalanced sequences.Although hLCS can also be applied to network intrusion detection systems [24], it is for inexact pattern matching, which is different from the PFAC-based algorithms that matches patterns exactly [10], including the two-phase PFAC algorithm proposed in this paper.
Pattern matching can be divided into two categories, i.e., single and multiple.The algorithms used for processing single patterns are as follows: the brute force [9,15], Rabin-Karp [25], Knuth-Morris-Pratt [26], Boyer-Moore [27], two-way string matching [28], and bit parallel algorithms [29][30][31][32].On the other hand, the most widely used algorithm for processing multiple patterns is the AC algorithm.In addition to this algorithm, the Commentz-Walter [33] and Wu-Manber (WM) algorithms [34,35] are used to process multiple patterns.The AC algorithm and parallel pattern matching algorithm are explained in the subsequent paragraphs.
The AC algorithm [9] belongs to the class of algorithms used to match multiple patterns, and it can be used in systems with a limited group of patterns, such as a NIDS.This algorithm is frequently used in the analysis of biological sequences, image processing, and in other intrusion detection systems.This algorithm [9] uses failure transitions to backtrack a state machine to recognize the prefix string of other patterns.When the length of an input stream is n, its total execution time complexity is O(n).When the number of patterns increases, many approaches cannot be applied for high throughput of matching in a sequential manner owing to considerable network traffic.Therefore, there are a number of algorithms that use a data parallel approach to improve the AC algorithm, which is referred to as a data-parallel AC (DPAC) approach [10].The DPAC approach divides an input stream into multiple chunks and assigns an individual thread for each chunk to perform the AC algorithm.However, it experiences the boundary detection problem, i.e., it cannot detect a pattern occurring in the boundary of adjacent chunks.To address the problem, every thread needs to scan across the boundary of its assigned chunks.The additional length that requires to be scanned across the boundary equals to the longest pattern length minus one.
To solve the boundary detection problem, the parallel failureless AC (PFAC) algorithm [10] was proposed by Lin et al. in 2013.This algorithm successfully improved the AC algorithm in a GPU without the need of boundary detection.The main idea of the PFAC algorithm is that it removes all failure transitions and self-loop transitions that backtrack the state machine in the original design of the AC algorithm.Furthermore, an individual thread is allocated to each byte of an input stream to identify any pattern starting at the thread's starting position.The total number of threads created by PFAC is equal to the length of an input stream.Consequently, whenever a thread cannot match any byte further, it terminates immediately.Although PFAC creates huge amounts of threads, most threads have a high probability of terminating very early because of no failure transitions in its finite state machine.
Another idea proposed in the PFAC algorithm is that each final state represents a unique pattern without handling multiple outputs.The state number of a final state is equal to the id of a matched pattern.That is, if there are m patterns, m final states will be created and they are numbered from 1 to m.In addition, the initial state is numbered (m+1), and all other internal states are numbered from (m+2).Therefore, the output table can be eliminated by reordering state number.When a thread encounters a state whose number is less than or equal to m, it means that it reaches a final state and the corresponding pattern is matched.Furthermore, the PFAC reports a matched pattern at its starting position in an input stream.If a pattern is a prefix of another longer pattern, they should be reported at the same position.
We give an example to explain the PFAC algorithm more detailed.Assume there are four patterns: "he," "hers," "his," and "she."The PFAC algorithm constructs the corresponding finite state machine based on the four patterns, as depicted in Figure 1.There are four final states: States 1, 2, 3, and 4, which correspond to four patterns, respectively.The initial state is State 5.No failure transitions in the finite state machine.For instance, in the traditional AC algorithm, there exists a failure transition from State 10 to State 9 when the next byte is "s," and a failure transition from State 10 to State 8 when the next byte is "i.".
Next, let us consider an input stream, "cchangicherscte".Because the length of the input stream is 15, totally 15 threads are created and each byte is assigned to one thread.For instance, Thread 0 will scan the input stream from the first byte: "c," and Thread 8 will scan from the ninth byte: "h."The initial state is State 5. Thread 0 terminates immediately because State 5 has only two transitions whose next bytes are "h" and "s", respectively.On the other hand, Thread 8 proceeds to State 6 because its starting byte is "h."After scanning the following three bytes, "ers," Thread 8 walks through States 1, 7, and 2, where States 1 and 2 are final states.pattern occurring in the boundary of adjacent chunks.To address the problem, every thread needs to scan across the boundary of its assigned chunks.The additional length that requires to be scanned across the boundary equals to the longest pattern length minus one.
To solve the boundary detection problem, the parallel failureless AC (PFAC) algorithm [10] was proposed by Lin et al. in 2013.This algorithm successfully improved the AC algorithm in a GPU without the need of boundary detection.The main idea of the PFAC algorithm is that it removes all failure transitions and self-loop transitions that backtrack the state machine in the original design of the AC algorithm.Furthermore, an individual thread is allocated to each byte of an input stream to identify any pattern starting at the thread's starting position.The total number of threads created by PFAC is equal to the length of an input stream.Consequently, whenever a thread cannot match any byte further, it terminates immediately.Although PFAC creates huge amounts of threads, most threads have a high probability of terminating very early because of no failure transitions in its finite state machine.
Another idea proposed in the PFAC algorithm is that each final state represents a unique pattern without handling multiple outputs.The state number of a final state is equal to the id of a matched pattern.That is, if there are m patterns, m final states will be created and they are numbered from 1 to m.In addition, the initial state is numbered (m+1), and all other internal states are numbered from (m+2).Therefore, the output table can be eliminated by reordering state number.When a thread encounters a state whose number is less than or equal to m, it means that it reaches a final state and the corresponding pattern is matched.Furthermore, the PFAC reports a matched pattern at its starting position in an input stream.If a pattern is a prefix of another longer pattern, they should be reported at the same position.
We give an example to explain the PFAC algorithm more detailed.Assume there are four patterns: "he," "hers," "his," and "she."The PFAC algorithm constructs the corresponding finite state machine based on the four patterns, as depicted in Figure 1.There are four final states: States 1, 2, 3, and 4, which correspond to four patterns, respectively.The initial state is State 5.No failure transitions in the finite state machine.For instance, in the traditional AC algorithm, there exists a failure transition from State 10 to State 9 when the next byte is "s," and a failure transition from State 10 to State 8 when the next byte is "i.".
Next, let us consider an input stream, "cchangicherscte".Because the length of the input stream is 15, totally 15 threads are created and each byte is assigned to one thread.For instance, Thread 0 will scan the input stream from the first byte: "c," and Thread 8 will scan from the ninth byte: "h."The initial state is State 5. Thread 0 terminates immediately because State 5 has only two transitions whose next bytes are "h" and "s", respectively.On the other hand, Thread 8 proceeds to State 6 because its starting byte is "h."After scanning the following three bytes, "ers," Thread 8 walks through States 1, 7, and 2, where States 1 and 2 are final states.In addition to the AC algorithm, there are other types of algorithms for multiple-patterns matching.Although the bit-parallel algorithm only matches a single pattern, Prasad et al. [36] extended the bit-parallel algorithm to search for multiple patterns with a same length in text simultaneously.Kusudo et al. [31] further extended the bit-parallel algorithm to match multiple In addition to the AC algorithm, there are other types of algorithms for multiple-patterns matching.Although the bit-parallel algorithm only matches a single pattern, Prasad et al. [36] extended the bit-parallel algorithm to search for multiple patterns with a same length in text simultaneously.Kusudo et al. [31] further extended the bit-parallel algorithm to match multiple patterns with varied lengths.They proposed a data padding scheme that regularizes both control flow and data structure.AVX instructions, from the instruction set, are used to increase search throughput per CPU core and OpenMP directives are employed to realize data-parallel search of strings.In 2017, Mitani et al. [32] proposed exact and approximate string matching algorithms that takes advantages of scan-based parallelization, segmentation-based parallelization, and bit-level parallelization.They interpreted bit-parallel algorithms as inclusive-scan operations that not only eliminate duplicate searches between threads but also realize an efficient memory access pattern.

Two-Phase PFAC Algorithm
The advantage of PFAC is avoidance of boundary detection by removing backward state transitions in the traditional AC algorithm.However, the PFAC algorithm still experiences load imbalance because attack patterns are with different lengths and every thread begins its pattern matching from a unique position in the input stream.If the pattern that is being matched by a thread is longer, the other threads in the same warp might have gradually become idle after finishing their matching, resulting in severe branch divergence and low usage of warps that together degrade system performance significantly.To address the problem, in this section we introduce the two-phase PFAC algorithm that can increase the usage of GPU resources and reduce the overall execution time.

Primary Idea
The performance issues of the PFAC algorithm are as follows.(1) Although PFAC creates huge amounts of threads, most threads have a high probability of terminating very early because of no failure transitions in its finite state machine, resulting in severe branch divergence and low GPU hardware utilization.(2) The state machine is too large to fit into the faster shared memory in a CUDA GPU.Instead, the finite state machine is stored in either the global memory or the texture memory, resulting in high-latency memory accesses and slow-speed state transitions.
We analyze the first performance issue more detailed by running the PFAC algorithm with the rule set of Snort version 2.9.The first experiment runs different input sizes to observe how many threads terminate their execution after a certain number of state transitions as shown in Table 1.Approximately 60% of the threads were idle after the first match when the input sizes are less than or equal to 32 MB, and more than 70% of the threads became idle if the input size are larger.If the PFAC algorithm performs one more matching, 82.26% to 96.31% of the threads terminated their executions because of failure matching.Finally, less than 1% of the threads required matching more than five times for all cases, indicating that only a small number of threads were active after five matchings.Furthermore, the active threads were actually distributed among different warps and each warp had only a small number of active threads.Table 2 shows the ratios of warps with less than 3 active threads in execution after five matchings.Note that there are totally 32 threads in each warp.More than 97% of the warps had one or two active threads that require further execution after five matchings, meaning the utilization ratio of each individual of the warps is quite low.
Because above 97% of the warps have less than 3 threads in execution after five matchings, the PFAC algorithm has a severe load imbalance problem.In other words, more and more threads become idle when they proceed to next states due to mismatches.Since a CUDA GPU adopts a single instruction, multi-threading parallel execution model and a warp is scheduled dynamically for execution at a streaming multiprocessor at a time, all the threads in a warp perform a same instruction with different data during each cycle.If a thread fails on an if -statement without an associated else-statement, it becomes idle if there are any active threads requiring to execute the codes in the body of the if -statement.Moreover, idle threads have to wait for active threads in the same warp before all the threads meet at the next convergence point.Consequently, the execution time of a warp is equal to the execution time of the last thread to complete its task and it is independent with execution times of other warps.The total execution time of a thread block is equal to the summation of execution times of all the warps in the block.We use the example shown in Figure 2 to explain the load imbalance in the PFAC algorithm because each thread will execute different numbers of letter matchings in the same warp.In this example, there are two warps and each warp consists of 32 threads.Each warp has only one thread that requires further matching after five matchings of letters, which the common case as implied in Tables 1 and 2. The number of idle threads is increased whenever one more state transition is made.As a result, the utilization ratio of each warp decreases when we proceed to subsequent matchings.Since the warp will be executed in interleaving, the total execution time for these two warps is equal to (T1 + T2 + T3 + T4).To minimize the total execution time of a thread block, it is best to allocate the matchings of similar workloads to warps as few as possible.However, it is impossible to have the information about the workload of each matching before actual execution.Therefore, we proposed a two-phase PFAC algorithm to address this problem.In the first phase, a thread will proceed to next states one by one as far as Y state transitions at most if no mismatch occurs, where Y is a threshold that can be predefined by users.The value of Y should be very small in terms of the experimental results shown in Tables 1 and 2. At the end of the first phase, all the active threads, distributed among different To minimize the total execution time of a thread block, it is best to allocate the matchings of similar workloads to warps as few as possible.However, it is impossible to have the information about the workload of each matching before actual execution.Therefore, we proposed a two-phase PFAC algorithm to address this problem.In the first phase, a thread will proceed to next states one by one as far as Y state transitions at most if no mismatch occurs, where Y is a threshold that can be predefined by users.The value of Y should be very small in terms of the experimental results shown in Tables 1 and 2. At the end of the first phase, all the active threads, distributed among different warps in the same block, will be merged into a few of warps and perform further pattern matching in the second phase.It can be expected that the number of active threads at the end of the first phase is much smaller than the total number of threads in the block.Consequently, only a very few number of warps will be executed in the second phase actually and each of these merged warps except the last one has 32 threads at the beginning of the second phase.That is, the two-phase PFAC algorithm can increase the usage of warps and make most warps idle as soon as possible, resulting in a reduced execution time.Therefore, the problem of branch divergence can be alleviated and a better system performance can be obtained.
We use the example shown in Figure 3, the same as that shown in Figure 2, to give the main idea of our proposed two-phase PFAC algorithm.The threshold is set five based on the experimental results shown in Tables 1 and 2. In the first phase, the pattern matchings are executed in the same way as that proposed by the PFAC algorithm.However, at most five letters are matched in this phase.Warp 0 and Warp 1 requires execution times of T1 and T3, respectively.Before proceeding to the second phase, Thread t 1 in Warp 0 and Thread t 62 in Warp 1 are merged into Warp 0. In the second phase, these two matchings will be executed by the first two threads in Warp 0 in parallel and the required execution time is the minimum of T2 and T4.As a result, the total execution time of the two-phase PFAC algorithm is (T1+ T3 + min (T2, T4) + T m ), instead of (T1 + T2 + T3 + T4) that is required in the PFAC algorithm, where T m represents the overhead of merging threads inside each block.The execution time difference between the two-phase FPAC algorithm and the PFAC algorithm increases when there exists one longer matching.The execution time difference becomes larger when the matched pattern is longer.

Implementation
We use an example with six patterns, as shown in Figure 4, to detail how to implement the proposed two-phase PFAC algorithm on a CUDA GPU.Based on the given six patterns, the corresponding PFAC state machine is constructed and illustrated in Figure 4.Because the first and second patterns have the same prefix string of three letters, these two patterns share the same state transitions for the first three letters.In other words, an input stream with a prefix of "cha" will lead to State 10 no matter what the suffix of the input stream is.However, because the fourth letters of these two patterns are different, there are two transitions leaving from State 10.If the fourth letter is

Implementation
We use an example with six patterns, as shown in Figure 4, to detail how to implement the proposed two-phase PFAC algorithm on a CUDA GPU.Based on the given six patterns, the corresponding PFAC state machine is constructed and illustrated in Figure 4.Because the first and second patterns have the same prefix string of three letters, these two patterns share the same state transitions for the first three letters.In other words, an input stream with a prefix of "cha" will lead to State 10 no matter what the suffix of the input stream is.However, because the fourth letters of these two patterns are different, there are two transitions leaving from State 10.If the fourth letter is "n", the state machine will transit to State 11 but "r" will lead to State 14.

Implementation
We use an example with six patterns, as shown in Figure 4, to detail how to implement the proposed two-phase PFAC algorithm on a CUDA GPU.Based on the given six patterns, the corresponding PFAC state machine is constructed and illustrated in Figure 4.Because the first and second patterns have the same prefix string of three letters, these two patterns share the same state transitions for the first three letters.In other words, an input stream with a prefix of "cha" will lead to State 10 no matter what the suffix of the input stream is.However, because the fourth letters of these two patterns are different, there are two transitions leaving from State 10.If the fourth letter is "n", the state machine will transit to State 11 but "r" will lead to State 14. Traditionally, one table is used to record the information about all state transitions for a state machine and the table is so large that it cannot be fit into the small but faster shared memory on a CUDA GPU.Consequently, the state transition table is stored in either the global memory or texture memory.To address the problem, we divide the state transition table into two tables and they are the Traditionally, one table is used to record the information about all state transitions for a state machine and the table is so large that it cannot be fit into the small but faster shared memory on a CUDA GPU.Consequently, the state transition table is stored in either the global memory or texture memory.To address the problem, we divide the state transition table into two tables and they are the Prefix PFAC Table and Suffix PFAC Table .The first table should be small enough to be stored in the faster shared memory while the second table is kept in the global memory or texture memory.
The Prefix PFAC Table, as shown in Table 3, records the state transitions only for the first two letters of the six patterns.This table is constructed as a 128*128 two-dimensional array and each dimension can be indexed by a letter.Each cell contains an integer number representing the next state.State 0 is reserved to represent a trap state, indicating a mismatch and that the responsible thread terminates immediately.For example, if the first two letters in the input stream are "ch," we can find the next state by looking up the table slot whose column index and row index correspond to the ASCII codes of "c" and "h", respectively.Therefore, the next state for the first two letters, "ch," is State 9. Similarly, if the first two letters are "fo," the next state is State 20.Because the maximum state number is less than 65536, each state number can be encoded with two bytes.Since the size of the Prefix FPAC Table is 128*128, the total memory size of the Prefix PFAC table is 32 KB, which can be fit into the much faster shared memory whose size is 48 KB in Tesla K20 and GTX TITAN X.
The Suffix PFAC  Figure 5 shows the initial state traversal algorithm of the first phase in the two-phase PFAC algorithm.We assigned each letter in the input stream to a thread as its starting letter for matching.In other words, the i-th letter in the input stream will be assigned to the thread with a global thread id equal to i.For example, if input size is 32 MB, the total number of threads required is 32*1024*1024.We can set the number of blocks and the number of threads to 32,768 and 1024, respectively.Therefore, the number of blocks is 32,768, recorded with the system variable GridDim.x, and the number of threads per block is 1024, recorded with the system variable BlockDim.x.The thread id, ThreadIdx.x, in each block ranges from 0 to 1023, and the block id, BlockIdx.x,ranges from 0 to 32,767.The global id, global_id, of a thread can be obtained by the following formula: global_id = ThreadIdx.x+ BlockIdx.x* BlockDim.x.
The index of the first letter to be matched by a thread with an id of ThreadIdx.x is equal to global_id.To resolve the load imbalance problem, we use a threshold to limit the maximum number of letters to be matched in the first phase.Because, for an input size of 32 MB, less than 1% of the threads requires further matching after 5 matchings, as shown in Table 1, the threshold is set to 5.
We use an Input array to represent the input stream, strIndex, a local variable, to represent the index of the current letter in Input to be processed by a thread, and next_state to record the next state.The initial value of strIndex is set to the index of the starting position in the input stream for the corresponding thread.The initial value of the current state is set as INITIAL_STATE whose state number is equal to the number of patterns plus one.For instance, if the pattern set is Snort V2.9, the value of INITIAL_STATE is 3863 because the total number of patterns is 3862.After scanning the first two assigned letters, a thread checks the next state by referring to the Prefix_PFAC_Table.If the state number of the next state is less than the value of INITIAL_STATE, it is a final state and recorded in the Output array.Otherwise, the next state is a transition state, the system needs further matching with the subsequent letters and the Suffix PFAC Table .The matching continues until the next state is the TRAP_STATE that is State 0. If the next_state is not equal to TRAP_STATE after the thread is executed to the threshold, implying the requirement of further matching in the second phase, we record set the corresponding element of the Incomplete array to 1; otherwise, 0 is recorded.The Incomplete array is used to record which threads have not been completed their matchings that will be processed in the second phase.Figure 6 shows the next states for each thread after reaching the threshold of the two-phase PFAC algorithm, which is set to 5. Assume the input stream is: "cchangi…..aa……lavaib………..," each letter is assigned to each thread one by one.For instance, Thread t0 scans the input stream from the beginning of the input stream, i.e., the letter "c," and Thread t2 scans the input stream from the third letter of the input stream, i.e., the letter "h."Since the first two letters are "cc", Thread t0 reaches a trap state after looking up the Prefix PFAC Table, meaning that there is no need to proceed to further comparison.For matchings that fail in the first phase are referred to as failure-search matchings.On the other hand, based on the information stored in the Prefix PFAC Table, Thread t1 will reach to Sate 9 because the first two letters for it are "ch," indicating that it is possible to have a match and further comparison is required.Thread t1 will use its third letter, "a," to lookup the Postfix PFAC Table for its next state and it is State 10.Next, Thread t1 continues its matching with the fourth and fifth letters, "ng," to find the next states one by one.After these matching, Thread t1 reaches to State 12 and it Figure 6 shows the next states for each thread after reaching the threshold of the two-phase PFAC algorithm, which is set to 5. Assume the input stream is: "cchangi . . ...aa . . . . . .lavaib . . . . . . . . ...," each letter is assigned to each thread one by one.For instance, Thread t 0 scans the input stream from the beginning of the input stream, i.e., the letter "c," and Thread t 2 scans the input stream from the third letter of the input stream, i.e., the letter "h."Since the first two letters are "cc", Thread t 0 reaches a trap state after looking up the Prefix PFAC Table, meaning that there is no need to proceed to further comparison.For matchings that fail in the first phase are referred to as failure-search matchings.On the other hand, based on the information stored in the Prefix PFAC Table, Thread t 1 will reach to Sate 9 because the first two letters for it are "ch," indicating that it is possible to have a match and further comparison is required.Thread t 1 will use its third letter, "a," to lookup the Postfix PFAC Table for its next state and it is State 10.Next, Thread t 1 continues its matching with the fourth and fifth letters, "ng," to find the next states one by one.After these matching, Thread t 1 reaches to State 12 and it requires further comparison.Because at most five letters will be matched in the first phase in the two-phase PFAC algorithm, we need to record the index of the fifth letter for Thread t 1 in the input stream and the next state after five matches before proceeding to the next phase of the PFAC algorithm.However, we record the thread id instead of the index of the next letter to be matched because we can easily have the index by adding five to the thread id.For the matchings requiring further comparison in the second phase are called continued-search matchings.In the second phase, we only need to process the continued-search matchings.In this example, only the matchings assigned to Threads t 1 and t 62 belong to this category in the first two warps.In other words, only two threads remain active after five matches and in fact each warp has only one active thread at the end of the first phase, which incurs a severe branch divergence and load imbalance.
Electronics 2019, 8, x FOR PEER REVIEW 12 of 21 requires further comparison.Because at most five letters will be matched in the first phase in the twophase PFAC algorithm, we need to record the index of the fifth letter for Thread t1 in the input stream and the next state after five matches before proceeding to the next phase of the PFAC algorithm.However, we record the thread id instead of the index of the next letter to be matched because we can easily have the index by adding five to the thread id.For the matchings requiring further comparison in the second phase are called continued-search matchings.In the second phase, we only need to process the continued-search matchings.In this example, only the matchings assigned to Threads t1 and t62 belong to this category in the first two warps.In other words, only two threads remain active after five matches and in fact each warp has only one active thread at the end of the first phase, which incurs a severe branch divergence and load imbalance.We will merge the continued-search matchings in the same block into warps as few as possible before proceeding to the second phase and we call this merging the job compression procedure.In this case, these two continued-search matchings can be merged into a single warp, leaving the second warp idle.Consequently, we need to perform a job compression process before going to the second phase.To perform the job compression procedure, we need to record the information about the continued-search matchings running by active threads.An Incomplete array is allocated to indicate whether the corresponding matching is completed or not, as described previously.In addition, the starting position for the subsequent execution of each continued-search matching is stored in the nextLetterIndex Array in the shared memory.
To perform the job compression procedure, if there are k continued-search matchings in a block, we will let the first k threads in the block be responsible for executing the further matchings in the second phase.We use the first two warps for explanation.We need to perform a parallel reduction operation on the Incomplete array to have the prefix sum [37] for each element, indicating the information about how many active threads before and including itself for each thread.We use the example shown in Figure 6 to explain the idea and the results are shown in Tables 3 and 4. For instance, since Thread t1 is the first active thread, its prefix sum is 1 while the result for Thread t0 is 0, as shown in Table 3.Similarly, Thread t62 is the second active thread and its prefix sum is 2.Moreover, the prefix sum for each thread between t1 and t62 is 1.Based on the corresponding prefix sum and We will merge the continued-search matchings in the same block into warps as few as possible before proceeding to the second phase and we call this merging the job compression procedure.In this case, these two continued-search matchings can be merged into a single warp, leaving the second warp idle.Consequently, we need to perform a job compression process before going to the second phase.To perform the job compression procedure, we need to record the information about the continued-search matchings running by active threads.An Incomplete array is allocated to indicate whether the corresponding matching is completed or not, as described previously.In addition, the starting position for the subsequent execution of each continued-search matching is stored in the nextLetterIndex Array in the shared memory.
To perform the job compression procedure, if there are k continued-search matchings in a block, we will let the first k threads in the block be responsible for executing the further matchings in the second phase.We use the first two warps for explanation.We need to perform a parallel reduction operation on the Incomplete array to have the prefix sum [37] for each element, indicating the information about how many active threads before and including itself for each thread.We use the example shown in Figure 6 to explain the idea and the results are shown in Tables 3 and 4. For instance, since Thread t 1 is the first active thread, its prefix sum is 1 while the result for Thread t 0 is 0, as shown in Table 3.Similarly, Thread t 62 is the second active thread and its prefix sum is 2.Moreover, the prefix sum for each thread between t 1 and t 62 is 1.Based on the corresponding prefix sum and Incomplete value, the i-th active thread whose Incomplete value is 1 can write the index of its next letter and the next state to the i-th elements in the corresponding arrays to be processed in the second phase.For instance, Thread t 1 will write its next state, State 12, and the next letter index to the first elements in the two arrays, newStrIndex and newState, which are used by Thread t 0 in the second phase, as shown in Table 3.Since the next letter index can be obtained by adding the thread id and the threshold value, i.e., 1 + 5 = 6, we record the thread id instead, as shown in Table 4. Similarly, Thread t 62 will write its next state, State 31, and its thread id, 62, to the second elements in the two arrays that will be used by Thread t 1 in the second phase.In this example, in the second phase, only the first two threads in the first warp will perform the further matching procedure because their next states are not equal to State 0, i.e., trap state.In this way, the continued-search matchings are merged into the first warp and executed by Threads t 0 and t 1 , alleviating the impacts of load imbalance and branch divergence.Figure 7 shows the algorithm of the job compression procedure, where the parallel reduction is the similar to that in [37].Figure 8 shows the second phase algorithm of the two-phase PFAC algorithm.The second phase is based on the results calculated in Table 4.Each thread will check its newStrIndex value to determine whether it needs execution of further matching in the second phase.It the value is not equal to −1, the thread will use the result of adding the threshold value to the newStrIndex value to have the correct letter index for the subsequent matching.In addition, the thread will read its newState value to access the Suffix PFAC Table for retrieving the next state.For instance, Thread t 0 will add 5 to its newStrIndex value, i.e., 1, to have the correct letter index 6 for its subsequent matching, as shown in Table 5.In addition, its next state is equal to 12 that is stored in the first element in the newState array, as shown Table 5.Similarly, Thread t 1 will add 5 to its newStrIndex value of 62 to have the correct letter index 67 for its subsequent matching and its next state is equal to 12 that is stored in the first element in the newState array.On the other hand, all the other threads will not execute any matching because their newStrIndex values are equal to −1.As a result, the second warp keeps idle during the whole second phase.

Performance Evaluations
This experiment was conducted on two workstations.The first workstation consists of an Intel Core i7-4790 CPU (Intel, Santa Clara, CA, USA) with a 64-bit Linux operating system (Ubuntu 12.4), and an NVIDIA Tesla K20 that adopts Kepler hardware architecture and its CUDA version (NVIDIA, Santa Clara, CA, USA) is 5.0.As shown in Table 6, Intel Core i7-4790 a multithreaded multicore Intel-architecture processor that features superscalar microarchitecture.It contains four cores with two-way hyper-threading on a chip that operates at 3.60 GHz.The second workstation consists of an Intel Core i5-6400 CPU (Intel, Santa Clara, CA, USA) with a 64-bit Linux operating system (Ubuntu 14.04) and an NVIDIA GeForce GTX TITAN X that uses second generation Maxwell hardware architecture and its CUDA version (Santa Clara, CA, USA) is 7.5.As shown in Table 7, Intel Core i5-6400 contains four cores with two-way hyper-threading on a chip that operates at 2.70 GHz.We use four pattern sets for the experiment, as shown in Table 8.The first pattern set, Rule set v1, is Snort v2.9 that consists of 3862 patterns and the other three pattern sets are based on Snort v2.9 but with different pattern-length distributions.In Rule sets v2, v3, and v4, there are more long patterns than Rule set 1 because the PFAC algorithm [10] pointed out that the system will face a relatively large load imbalance when the average pattern length of successful matchings is longer.If the proportion of long patterns increases, the two-phase PFAC algorithm is expected to improve the matching process of these patterns much more because an increase in the average length of the pattern increases the time required by the threads to reach an idle state in a warp.The throughput comparison between the two-phase PFAC and PFAC algorithms when applied on different workstations is shown in Figures 11 and 12.The figures show that the throughput of the two-phase PFAC algorithm increases by a factor ranging from 1.3 to 1.5, when applied on the K20 workstation.The throughput of the algorithm increases by a factor ranging from 1.5 to 1.9, when applied on the TITAN X workstation.Throughput is calculated by the following formula:

Throughput = File size (MB) / Execution time (seconds).
The probability of success in the matching process affects the enhancement in throughput.Because there is considerable disparity between the probabilities of matching success/failure for rule sets v1 and v3, the enhanced throughputs of these rule sets are superior to those of the other two rule sets.The throughput comparison between the two-phase PFAC and PFAC algorithms when applied on different workstations is shown in Figures 11 and 12.The figures show that the throughput of the two-phase PFAC algorithm increases by a factor ranging from 1.3 to 1.5, when applied on the K20 workstation.The throughput of the algorithm increases by a factor ranging from 1.5 to 1.9, when applied on the TITAN X workstation.Throughput is calculated by the following formula: Throughput = File size (MB) / Execution time (seconds).
The probability of success in the matching process affects the enhancement in throughput.Because there is considerable disparity between the probabilities of matching success/failure for rule sets v1 and v3, the enhanced throughputs of these rule sets are superior to those of the other two rule sets.Thus, it can be shown that given a set of varying input sizes, the speedup obtained from the K20 workstation increases from 1.2 to 1.4 with increasing input size, as shown in Figure 9. On the contrary, the TITAN X workstation yields speedups ranging from 1.8 to 2, as shown in Figure 10, and the two-phase PFAC algorithm does not slow down owing to an increase in the average length of the pattern or an increase in input size.
The throughput comparison between the two-phase PFAC and PFAC algorithms when applied on different workstations is shown in Figures 11 and 12.The figures show that the throughput of the two-phase PFAC algorithm increases by a factor ranging from 1.3 to 1.5, when applied on the K20 workstation.The throughput of the algorithm increases by a factor ranging from 1.5 to 1.9, when applied on the TITAN X workstation.Throughput is calculated by the following formula:

Conclusions
The PFAC algorithm previously proposed by Lin et al. can improve performance of parallel multi-pattern matching on a CUDA GPU by introducing an idea of a failureless finite-state machine.Although it effectively addresses the problem of redundant matching between threads but it still suffers from the problem of significant load imbalance among a warp because most threads achieve an idle state after being executed for less than five letter matchings.The idle threads have to wait for the last thread in the same warp to finish its work.In this paper, we propose a two-phase approach to resolve the problem of load imbalance caused by the irregular length distribution in a warp during the parallel process of pattern matching on a CUDA GPU.In the first phase, a thread will can match a number of letters at most, which is determined by a threshold predefined by users.The threshold is usually very small and is set to five in our experiment.In addition, the state transitions for the first

Conclusions
The PFAC algorithm previously proposed by Lin et al. can improve performance of parallel multi-pattern matching on a CUDA GPU by introducing an idea of a failureless finite-state machine.Although it effectively addresses the problem of redundant matching between threads but it still suffers from the problem of significant load imbalance among a warp because most threads achieve an idle state after being executed for less than five letter matchings.The idle threads have to wait for the last thread in the same warp to finish its work.In this paper, we propose a two-phase approach to resolve the problem of load imbalance caused by the irregular length distribution in a warp during the parallel process of pattern matching on a CUDA GPU.In the first phase, a thread will can match a number of letters at most, which is determined by a threshold predefined by users.The threshold is usually very small and is set to five in our experiment.In addition, the state transitions for the first two letters are extracted from the original next state table and are organized into a small two- The probability of success in the matching process affects the enhancement in throughput.Because there is considerable disparity between the probabilities of matching success/failure for rule sets v1 and v3, the enhanced throughputs of these rule sets are superior to those of the other two rule sets.

Conclusions
The PFAC algorithm previously proposed by Lin et al. can improve performance of parallel multi-pattern matching on a CUDA GPU by introducing an idea of a failureless finite-state machine.Although it effectively addresses the problem of redundant matching between threads but it still suffers from the problem of significant load imbalance among a warp because most threads achieve an idle state after being executed for less than five letter matchings.The idle threads have to wait for the last thread in the same warp to finish its work.In this paper, we propose a two-phase approach to resolve the problem of load imbalance caused by the irregular length distribution in a warp during the parallel process of pattern matching on a CUDA GPU.In the first phase, a thread will can match a number of letters at most, which is determined by a threshold predefined by users.The threshold is usually very small and is set to five in our experiment.In addition, the state transitions for the first two letters are extracted from the original next state table and are organized into a small two-dimensional array that is placed in the faster small shared memory.Typically, the latency of accessing shared memory is only one hundredth of that of accessing global memory.Consequently, the small split table can accelerate the look up of the next state for the first two letters.It is expected that only a few threads in a warp requiring further matching that is performed in the second phase of our proposed algorithm.To address the problem of branch divergence in a warp, we rearrange the mapping of tasks to threads in each block.All the tasks in a block requiring the execution of the second phase are mapped to the continuous threads beginning from the first thread.We use the prefix sum technique to remap the active threads to threads.After the remapping, only a small number of warps need to execute the second phase.
The experimental results illustrate that compared to the PFAC algorithm, the proposed method can increase throughput by approximately 50% when applied to various pattern sets and input stream combinations.In addition, our approach outperforms the FPAC with speedups ranging from 1.2 to 2.0.In future, we aim at further optimizing the proposed approach by improving the executions after the threshold in the first phase.Such an optimization can be expected to simultaneously improve warp usage rate and thread utilization.In addition, if the job compression function can resolve the memory bank conflict, it should improve memory access efficiency.

Figure 2 .
Figure 2. Load imbalance in the original PFAC algorithm.

Figure 3 .
Figure 3.The main idea of the proposed two-phase PFAC algorithm.

Figure 3 .
Figure 3.The main idea of the proposed two-phase PFAC algorithm.

Figure 4 .
Figure 4. System-created PFAC machine based on six given patterns.

Figure 4 .
Figure 4. System-created PFAC machine based on six given patterns.

Figure 5 .
Figure 5. Phase one of the two-phase PFAC algorithm: Initial state traversal.

Figure 5 .
Figure 5. Phase one of the two-phase PFAC algorithm: Initial state traversal.

Figure 6 .
Figure 6.The next state for each thread after reaching the threshold, which is set to 5.

Figure 6 .
Figure 6.The next state for each thread after reaching the threshold, which is set to 5.

Electronics 2019, 8 , 21 Figure 7 .
Figure 7. Phase one of the two-phase PFAC algorithm: Job compression procedure.Figure 7. Phase one of the two-phase PFAC algorithm: Job compression procedure.

Figure 7 .
Figure 7. Phase one of the two-phase PFAC algorithm: Job compression procedure.Figure 7. Phase one of the two-phase PFAC algorithm: Job compression procedure.

Figure 8 .
Figure 8. Phase two of the two-phase PFAC algorithm: Remainder traversal.

Figure 9 .
Figure 9.Comparison of speedup between two-phase PFAC and PFAC algorithms with different rule sets and input sizes on K20 workstation.

Figure 10 .
Figure 10.Comparison of speedup between two-phase PFAC and PFAC algorithms with different rule sets and input sizes on GTX TITAN X workstation.

Figure 9 .
Figure 9.Comparison of speedup between two-phase PFAC and PFAC algorithms with different rule sets and input sizes on K20 workstation.

Figure 9 .
Figure 9.Comparison of speedup between two-phase PFAC and PFAC algorithms with different rule sets and input sizes on K20 workstation.

Figure 10 .
Figure 10.Comparison of speedup between two-phase PFAC and PFAC algorithms with different rule sets and input sizes on GTX TITAN X workstation.

Figure 10 .
Figure 10.Comparison of speedup between two-phase PFAC and PFAC algorithms with different rule sets and input sizes on GTX TITAN X workstation.

21 Figure 11 .
Figure 11.Comparison of throughput between two-phase PFAC and PFAC algorithms with different rule sets and input sizes on K20 workstation.

Figure 12 .
Figure 12.Comparison of throughput between two-phase PFAC and PFAC algorithms with different rule sets and input sizes on GTX TITAN X workstation.

Figure 11 . 21 Figure 11 .
Figure 11.Comparison of throughput between two-phase PFAC and PFAC algorithms with different rule sets and input sizes on K20 workstation.

Figure 12 .
Figure 12.Comparison of throughput between two-phase PFAC and PFAC algorithms with different rule sets and input sizes on GTX TITAN X workstation.

Figure 12 .
Figure 12.Comparison of throughput between two-phase PFAC and PFAC algorithms with different rule sets and input sizes on GTX TITAN X workstation.

Table 1 .
Thread termination ratios with different input sizes.

Table 2 .
Ratios of warps with less than 3 active threads in execution after five matchings.
Table is similar to the state transition table used in the PFAC machine proposed by Lin et al. except that the entries corresponding to the first two letters are eliminated, resulting in a smaller table that is stored in the global memory.The Suffix PFAC Table is also a two-dimensional array but the organization is different from the Prefix PFAC Table.Each row in the Suffix PFAC Table corresponds to a next state while each column corresponds to a letter.Each slot contains a next state and it can be accessed with the current state and the input letter.The Suffix PFAC Table is accessed only when the next state in the Prefix PFAC Table is not State 0, where the next state is determined by the first two letters in the input stream.

Table 4 .
The information about the job mapping in the first phase.

Table 5 .
The information about the job mapping in the second phase.

Table 6 .
Specifications of host and device machines of K20 workstation.

Table 7 .
Specifications of host and device machines of TITAN X workstation.

Table 8 .
Different rule sets.As shown in Figures 9 and 10, the speedup of the two-phase PFAC algorithm outperform the PFAC algorithm on different workstations, and speedup is calculated by the following formula: Speedup = PFAC execution time / Two-Phase PFAC execution time.