Accelerating Event Detection with DGCNN and FPGAs

: Recently, Deep Neural Networks (DNNs) have been widely used in natural language processing. However, DNNs are often computation-intensive and memory-expensive. Therefore, deploying DNNs in the real world is very difﬁcult. In order to solve this problem, we proposed a network model based on the dilate gated convolutional neural network, which is very hardware-friendly. We further expanded the word representations and depth of the network to improve the performance of the model. We replaced the Sigmoid function to make it more friendly for hardware computation without loss, and we quantized the network weights and activations to compress the network size. We then proposed the ﬁrst FPGA (Field Programmable Gate Array)-based event detection accelerator based on the proposed model. The accelerator signiﬁcantly reduced the latency with the fully pipelined architecture. We implemented the accelerator on the Xilinx XCKU115 FPGA. The experimental results show that our model obtains the highest F1-score of 84.6% in the ACE 2005 corpus. Meanwhile, the accelerator achieved 95.2 giga operations (GOP)/s and 13.4 GOPS/W in performance and energy efﬁciency, which is 17/158 times higher than the Graphics Processing Unit (GPU).


Introduction
With the rapid development of the Internet, it has become more and more important to extract usefully structured information from massive amounts of unstructured texts. The Information Extraction (IE) tasks aim to identify event descriptions (including entities, relationships, and events) from the unstructured natural text, to classify them into predefined categories, and to store them in a structured form for users to query and further analyze. Event Detection (ED), which aims to accurately identify event triggers of specific types, is an important and challenging part of IE tasks. For example, a "movement" event triggered by "force" should be extracted from the following texts "A wildfire in California forced hundreds of people from their homes".
Event detection systems today are commonly based on pattern matching [1][2][3][4] and machine learning (ML) [5][6][7][8]. Pattern matching achieves high performance in a particular domain but less portability. Whenever the system is ported to a new scenario, new patterns must be built. Tuning patterns is a time-consuming process and requires considerable experience. Meanwhile, machine learning does not require much guidance from domain experts and has better portability. With the increasing abundance of various textual resources on the Internet, the corpus is no longer the bottleneck for machine learning. At present, machine learning has become the main research method for event detection. these issues, we present the design of a Chinese event detection accelerator for FPGAs. In order to reduce the computation and footprint, we design a novel CNN model based on multilayer Dilate Gated Convolutional Neural Network (DGCNN). EE-DGCNN [34] demonstrated the potential of the DGCNN for event detection tasks and FPGA implementations. DGCNN can reduce computation complexity while obtaining long-term dependencies owing to the use of the dilated convolution. This paper makes the following major contributions: 1. To our best knowledge, we are the first to study FPGA acceleration for NLP tasks. We present a CNN model which overwhelms the previous event detection works on ACE 2005 Chinese corpus. 2. We improve the computational efficiency of the model on hardware by optimizing the activation function and quantizing the model. 3. We implement our event detection accelerator on FPGA and show significant improvements over CPU and GPU baselines.
The paper is organized as follows. Section 2 introduces the basic background of event detection and the design directions of the accelerator. Section 3 presents our Chinese event detection model. Section 4 shows our hardware-oriented model optimizing strategy. Section 5 introduces the architecture of the accelerator. Section 6 reports the experimental results, and Section 7 concludes the paper.

Event Detection
The event detection task in this paper was defined in Automatic Content Extraction (ACE) [35] evaluations. The event consists of an event trigger and an event argument. The event detection task mainly aims to find event triggers and to categorize events to predefined event types. To help understand the task, we first introduce some event extraction terms.
• Event mention: The description of a specific event, including trigger words and argument. • Event type: The specific predefined category of an event. • Event trigger: Keywords or phrases that clearly express the occurrence of an event. • Event arguments: The participants or attributes of the event.
The ACE 2005 evaluation has 8 types of events and 33 subtypes. In this paper, we ignore the hierarchical structure and make 33 subtypes as 33 types of events. Besides, we add a NONE type to predefined event types for non-trigger words.

Convolutional Neural Networks
Convolutional neural networks generally consist of various network layers such as the convolutional layer, the fully connected layer, and the pooling layer. In particular, convolutional layers are the heart of the convolutional neural network, which often occupy the main part of the computation of the entire convolutional neural network. Therefore, it is critical for accelerating the convolutional neural network to simplify the convolutional computation or to design an appropriate hardware architecture to fit the computation. Methods such as the frequency domain convolution and Winograd algorithm are widely used. The convolution can be categorized into one-dimensional (1-D) convolution, two-dimensional (2-D) convolution, and others based on the dimension. The 1-D convolution is often used in sequence problems, such as NLP tasks. The 2-D convolution is usually used in image processing tasks, such as object recognition. However, there is a difference in the ED task. For traditional ED models, the researches tend to use a whole sentence as an input to extract the dependencies of adjacent words. Thus, they have to perform convolution operations within and between words at the same time and they use a 2-D convolution instead of a 1-D convolution.
For simplicity, we introduce the 2-D convolution in ED tasks with a sentence that contains l words. Assuming that each word is represented by a vector of length w, the sentence constitutes a 2-D matrix of the shape [l, w]. Suppose the kernel with the shape of [k, w] is used for the convolution operation, where k is the height of the kernel. The convolutional window generates a series of sets of features by sliding between words. By multiplying and summing each set of features with the convolution kernel, we get the final result as shown in Figure 1. We use X for the input sentence, Y for the output, and W for the kernel. Assuming the stride size of 1, the results are computed as the Equation (1). When processing the Chinese corpus, each Chinese character has an embedding representation, just like an English word.

Dilated Convolutional Neural Networks
Dilated convolutional neural networks were first applied to the neural network by Yu and Koltun [36] for solving the imperfect match between classical convolutional neural networks and semantic segmentation requirements. Compared to classical convolution, the dilated convolution adds an extra option for the kernel: the dilation. If the dilation is 1, the convolution is the same as a normal convolution. If the dilation is 2, then a zero will be inserted between every two elements of the kernel. In general, suppose the dilation is n, then n − 1 zeroes will be inserted between every two elements of the kernel. Figure 2 shows the information extraction capability of normal convolution and dilated convolution with the same size of kernel and three convolutional layers. The nodes in the third layer of normal convolution can obtain information from four nodes in the first layer at the maximum, while the nodes in the third layer of dilated convolution can obtain information from eight nodes in the first layer at the maximum. Moreover, the parameters and computations are exactly the same as those of normal convolution. There are two main advantages of dilated convolution.

•
The receptive field will expand as the dilatation rate increases. • The number of computations and parameters do not change with the dilatation rate.
Normal convolutional neural networks generally perform worse than recurrent neural networks (e.g., LSTM) in some NLP tasks [37]. A major reason is that the receptive field of the convolutional neural network is limited by the size of the kernel. In contrast, recurrent neural networks naturally have access to information over long distances. In general, each output feature can only capture information from input features covered by kernels. There is no possibility to capture information from long distances in the normal convolutional layer. For information at longer distances, the convolutional kernel must be expanded. For 1-D convolution, the number of parameters and computations is linearly related to the kernel size, while for 2-D convolution and other high-dimensional convolution, expanding the kernel means the amount of computations explodes. The appearance of dilated convolutional neural networks greatly alleviates this problem. By adjusting the dilatation rate, convolutional neural networks gained the perception of long-term dependence similar to recurrent neural networks while keeping the characteristic of higher parallelizability than recurrent neural networks. Therefore, dilated convolution is more suitable for NLP tasks as well as hardware acceleration than normal convolution.

Word Representation
Words are usually represented using vectors in NLP tasks. Early researchers used to encode words with one-hot, but this approach cannot reflect the inner connection between words and it often leads to the curse of dimensionality when facing a large corpus of text. Word2vec [38] was proposed to solve these problems by mapping words to a vector space. It greatly reduced the data dimension, and word vectors that share common contexts were located close to one another in the space. However, they only had one representation for one word. It was helpless to deal with the polysemy. Therefore, Peters et al. [39] proposed Embeddings from Language Models (ELMo), which uses bidirectional LSTM to obtain the contextual information of each word and generates representation for every word. OpenAI applied a similar idea and proposed the OpenAI GPT (generative pretrained transformer) [40]. They used a transformer instead of the traditional bidirectional LSTM which ELMo used. However, OpenAI GPT used unidirectional transformers and only the previous contextual information was actually obtained. Therefore, Google [41] proposed BERT (Bidirectional Encoder Representations from Transformers). BERT performed amazingly well on SQuAD (The Stanford Question Answering Dataset) 1.1 and achieved the best performance in 11 different NLP tasks. Later, Facebook [42] further optimized it and developed RoBERTa (Robustly optimized BERT approach), which uses a larger model with more training data and outperforms BERT.

EE-DGCNN
EE-DGCNN was one of the best models in event extraction recently. It achieved the best performance in ACE 2005 corpus with the help of the DGCNN. Its computation complexity was much smaller than other models such as standard CNN and bidirectional LSTM. EE-DGCNN consisted of a 12-layer DGCNN and a fully connected layer. The DGCNN is a new structure based on the dilated convolution combined with the gated linear unit and residual connection. Gehring et al. [43] first proposed this structure and applied it to the NLP task. Figure 3 shows the basic structure of the DGCNN. It can be calculated by Equation (2): where Y is the output result, X is the input vector sequence, σ is the Sigmoid function, and 1DConv 1 and 1DConv 2

Model Design
From Section 2, the dilated CNNs can obtain long-term dependencies in NLP tasks while the standard CNNs cannot. Although dilated CNNs cause irregularities in accessing memory during computation, we can avoid it by designing FPGA hardware accelerators with a custom hardware architecture. Inspired by EE-DGCNN, we chose DGCNN as the centre of our event detection network. The performance of EE-DGCNN has demonstrated that DGCNN has adequate potential for event extraction. However, there are two major limitations of the EE-DGCNN in Chinese event detection.
EE-DGCNN was originally designed to extract events for English. It lacks sufficient classification ability when migrating to Chinese. We trained the EE-DGCNN model directly on ACE 2005 corpus. Its performance is comparable to the previous work. The detailed results are shown in Table 1. Like most of the neural network models, the EE-DGCNN was not designed for hardware. For example, the gated mechanism contained in the EE-DGCNN requires the Sigmoid function as a gated function, but the Sigmoid function consumes huge resources to calculate in the hardware.
To address these, we modify the network structure based on EE-DGCNN to make it more suitable for the Chinese event detection task and the hardware implementation. The biggest differences compared to the EE-DGCNN are the following.
Wider word vector representation. Compared to English, event extraction studies on Chinese are more difficult. The English event extraction task can capture more additional information at the lexical level than Chinese. For example, English words are separated while Chinese lacks natural delimiters. We can tell singular and plural by word form and part-of-speech tagging. English verbs have morphological changes, and verbal nouns are easily distinguished from general verbs. Therefore, we decided to expand the word representation with more linguistic features for improving event detection performance. In the original EE-DGCNN, the authors used BERT to encode the input text. When we used EE-DGCNN to process the Chinese language, we chose the Chinese BERT model with a 12-layer transformer and 768 hidden units. However, the experimental result was not satisfying. We decided to use the Chinese RoBERTa model with 1024 hidden units to encode the Chinese text. We retrained EE-DGCNN based on RoBERTa and as shown in Table 2. The results showed that using RoBERTa can significantly improve the classification performance of the model.
Deeper network. Unlike other CNN models (such as DMCNN), the EE-DGCNN model was a character-wise classification model. Therefore, when we expanded the dimension of the word representation, the receptive field was not enough to cover the whole word. There are two options to solve this problem: increasing the dilation or network depth. Obviously, expanding the dilation can ensure that the receptive field is large enough to reach the entire word vector for one hidden unit. However, large dilations may cause some information missing and may bring obstacles for future hardware parallelism, especially for small capacity embedded hardware. Therefore, we chose to increase the depth of the network for the performance of our model. Table 3 shows the performance of the model with various depths.

Optimization of Dilation
The dilation is a key parameter of the dilated convolution. It directly affects the accuracy of the network, but at the same time, it also determines the locality of reference of the network, which is crucial for the performance of hardware computations. The accelerator speeds up the computation by increasing the parallelism, which requires the memory to provide more than one input simultaneously. However, the Block Random Access Memory (RAM) on the FPGA chip can only provide two R/W ports at most. We have to divide the memory when the parallelism is greater than 2. EE-DGCNN used three dilations: 1, 2, and 5. Note that 5 is not a power of 2, which means that the memory (buffers) needs to be divided into 5n parts. However, the size of both the Block RAM and our data is divisible by 2. Therefore, using such dilations will inevitably waste memory resources to satisfy the data throughput rate. Otherwise, the pipeline must be stalled to reduce the requirements for data throughput rates, which further slows down the accelerator. Consequently, we modified the original dilations and retrained the model to keep the performance of the model. The experiments show that adjusting the dilations to 1, 2, and 4 can achieve better performance than 1, 2, and 5, and the results are shown in Table 4.

Simplification of Sigmoid
In DGCNN, the Sigmoid function is necessary for processing the output of the dilated convolution. The formula for the Sigmoid function is expressed as Equation (3).
The Sigmoid function contains exponential operations and division operations, which are expensive for hardware implementation. We optimized the Sigmoid function from the viewpoint of accelerator implementation and network model to observe the effect of Sigmoid function on the performance of the model. Firstly, we approximated the Sigmoid function using the Range Addressable Lookup Table method. We can get the Sigmoid result directly by looking up the Block Random Access Memory (BRAM). Experiments showed that the output of the Sigmoid function only requires 8-bit for hardware calculation. The specific strategy is as follows.
(1) When the input is greater than 4, the output result is 1.
(2) When the input is greater than or equal to −4 and less than or equal to 4, use the 2 integer digits of the input and 6 decimal digits as the address to get the result of the Sigmoid.
(3) When the input is less than −4, the output result is 0. The approximated Sigmoid function is shown in Figure 4. On the other hand, we carefully studied the structure of the DGCNN network and concluded that the Sigmoid function is just used to provide a threshold value and is not irreplaceable. Therefore, we replaced the Sigmoid function with the more hardware-friendly Hard Sigmoid function. It is calculated by Equation (4): Considering that sard Sigmoid still requires a multiplication operation and an addition operation, we further modified the Hard Sigmoid function by modifying the factor 0.2 of x to 0.125 and 0.25, as shown in Figure 4. The advantage of using these two factors is that the multiplication can be done only by a shift operation, which greatly reduces the area usage of the chip. We retrained the modified DGCNN and the results satisfied our conjecture. The F1-score with the Hard Sigmoid function of the factor 0.125 improved by about 1% over standard Sigmoid, as shown in Table 5.

Quantization
DGCNNs dramatically reduced the computational complexity, but all data representations were using 32-bit floating-point, consuming much more resources. Recent works [45] demonstrated that most neural networks do not require 32-bit in the inference process, and a 16-bit data bit width is usually sufficient to hold the accuracy. In this paper, we quantized floating-point numbers into fixed-point numbers and reduced the data bit width to the minimum while maintaining the accuracy. This can achieve higher parallelism under the same area and power consumption while further reducing the pressure on bandwidth and memory resources.
The quantization strategy was determined by the data distribution and was tested by software simulation. Figure 5 shows the value distribution histogram. In this paper, various combinations of data precision are tested, and it is finally determined that the weights and inputs of the network need only 8 bits to meet the precision requirements, while the data and intermediate results need 16 bits to keep the network precision. Table 6 shows the F1-score losses with quantization.
We finally design an event detection model suitable for hardware implementation through the above design methods. The final network structure is shown in Table 7.

Overall Architecture
In most CNN accelerators, the size of feature maps between two layers often significantly overflows the size of the on-chip memory in the FPGA. This requires that the intermediate results be stored to off-chip memory, which takes up lots of time. However, the feature maps between the two dilated convolutional layers of our network is only 2 kilobytes at most and we do not need to cache all intermediate results with fully pipelined architecture. Therefore, our accelerator only accesses off-chip memory for obtaining inputs and returning outputs.

Compute Unit Architectures
Mapping Unit. The different dilations result in different requirements for loading in parallel. In order to meet the parallel computing requirement, we need to split and regroup the input data to make sure that the computing unit can access the necessary input data at the same time. The mapping unit mainly consists of mapping logic and a variable number of linebuffer. Suppose the input parallelism is p and the dilated rate is d, then the data needs to be split into 2 × (d − 1) + p banks. Figure 7 shows how the input data is partitioned when p is 4 and d is 2. DGCNN Unit. The DGCNN unit is the most critical part of the accelerator. As shown in Figure 6, the DGCNN unit consists of the mapping unit, input buffer, MAC (Multiply ACcumulate) array, CONV (CONVolution) buffer, and gate MAC array. Particularly, every weight of a dilated gated convolutional layer only involves three 8-bit numbers which can be stored in on-chip memory easily. Input buffer and CONV buffer are composed of several line buffers, which are used to cache the data in the pipeline. The MAC array contains 3 × p Digital Signal Processing (DSP) slices, where p is the parallelism of the input data. The gate MAC array consists of p DSP slices for gate mechanism and addition logic for residual structure. The computational flow of DGCNN unit includes 5 stages.  Softmax Unit. Softmax is often the last activation function of a neural network to normalize the output of a network to a probability distribution over predicted classes. It is defined as Equation (5): Obviously, Softmax is as unsuitable for hardware computing as Sigmoid. The difference is that the purpose of the Softmax function is to find the most probable category, which corresponds to the maximum value of inputs. Therefore, we ignored the complex mathematical calculations and used a pipelined comparator with 34 inputs to get the Softmax result.

Layer Fusion
Unlike common CNNs, each convolutional layer of our DGCNN model has the same computational complexity and its throughput rate is exactly the same. Therefore, this paper constructed a fine-grained pipeline by fusing all network layers to reduce computational latency.
Intralayer Pipeline. For a convolutional layer, the fastest way to start outputting the results is to compute simultaneously within the same convolutional window. The ideal situation is to complete all computations for a group of inputs at the same time. In this paper, this means that we need to calculate 3072 multiplications and additions at the same time. This not only requires a large number of multiplier, but also needs high memory bandwidth. It is obvious that computing in full parallel is extremely difficult on the hardware. In our model, the dilated convolution is a 1-D convolution. The data dependence of different layers is unidirectional. Therefore, it is important to focus on outputting the results as quickly as possible when computing in parallel in the layer. We fully unrolled the loop of the kernel and limit the unrolled between sliding windows to balance resource constraints and performance requirements.
Interlayer Pipeline. We designed a fine-grained interlayer pipeline to reduce the overall latency of the network. All layers of the network are fused to one layer. Take the example of the 3-layer network shown in Figure 8. The computation of the second layer will be started immediately when all the blue elements of the first layer are ready. Similarly, the computation of the third layer starts immediately after the second layer has computed the yellow elements. After the third layer finishes calculating, the results are saved instantly to reduce unnecessary memory usage. In brief, the latencies of the different layers are overlapped. Therefore the overall latency is much smaller than the sum of the latencies of all layers.  Figure 9 shows the traditional pipeline and the layer-fused pipeline. It is clear that the layer-fused pipeline can significantly reduce the data processing latency.

Experimental Setup
The ACE 2005 corpus is based on real-world radio, news, and web blogs. It can reflect the performance of models in real-world scenarios. We used the same data split for training and testing as in previous studies [44]. Accuracy cannot properly measure the performance of the model with uneven class distribution, which is common in NLP tasks. Therefore, we used precision, recall, and F1-score to evaluate our event detection model. Precision is the ratio of correctly predicted positive labels to the total predicted positive labels. Recall is the ratio of correctly predicted positive labels to all labels except for NONE. We can tell that both precision and recall only measure the model in an isolated dimension. The F1-score takes both precision and recall into account and provides a better measure of model performance. It can be calculated by Equation (6): A trigger is considered to be correct if its type and offset match the correct label. In most trainings, we set the learning rate to 1e − 3, the maximum sequence length to 128, and the batch size to 4 and used Adam as the gradient descent optimizer.
The accelerator is based on the Xilinx XCKU115 FPGA chip. It is written in Verilog, and all syntheses are from Xilinx Vivado 2018.3. The general purpose processing platform is based on the Intel Core i7-8700k CPU and the NVIDIA GTX 1080 GPU. The code leverages Keras and calls CUDA (Compute Unified Device Architecture) for GPU.

Evaluation of the Model
We chose previous works which are also based on the ACE 2005 corpus as the baseline for comparison. As shown in Table 8, our model significantly outperformed previous works before 2019. Our results also outperformed previous BERT-based works. Compared to the original EE-DGCNN model, our optimization strategy results in a nearly 12% point improvement in F1-score. Moreover, our model is more suitable for hardware acceleration than other works (e.g., Bi-LSTM [29] and NPN [23] (Nugget Proposal Networks)). Moreover, the precision of the DGCNN-based model is significantly higher than the recall. By analyzing the experiments and previous works, we believe that two causes resulted in this phenomenon. First, the BERT-based word representation can significantly improve precision with little effect on the recall. Balali et al. [46] obtained a 6.7% point increase in precision and only a 0.69% point increase in recall using the BERT-based word representation compared to the glove-based word representation with the same model. Second, DGCNN is a character-wise model. In contrast to the word-wise model, the character-wise model lacks information about the location of the characters. Meanwhile, our evaluation requires that a word be considered correctly classified only if all characters inside the word are classified correctly. This also affects the precision and recall [29].

Evaluation of the Accelerator
The resource utilization of our implementation is reported in Table 9. We can tell that our DGCNN accelerator is very small. All resource utilization percentages are less than 15%. This means that our proposed hardware architecture can be implemented on the various resource-limited platforms (e.g., embedding platforms).
We made comparisons with general purpose processing platforms. As shown in Table 10, we compared based on the Intel Core i7-8700k and NVIDIA GTX1080. Note that the power of the CPU is its thermal design power, the value of which is from [47], while the GPU power value was from the nvidia-smi program. FPGA power was reported by Xilinx Vivado 2018.3. As we can see, our accelerator is significantly superior to both CPU and GPU both in throughput and energy efficiency. Our throughput and energy efficiency are 17× and 158× higher than the GPU, respectively. We analyzed the reasons for the low performance of CPUs and GPUs. The reason could be caused by the fact that the deep learning framework was not yet optimized for the DGCNN structure, resulting in inefficient computation. This can be seen by the fact that the GPU with the thermal design power of 180 W [48] consumed only 66 W of power at full utilization.

Conclusions
In this paper, we analyzed the event detection task and proposed a hardware-friendly Chinese event detection model based on EE-DGCNN. We optimized the model by adjusting the dilation and by replacing the Sigmoid function to make it more suitable for hardware implementation. Additionally, we quantized the parameters and activations to further reduce hardware complexity and resource utilization. The model achieved the best F1-score on the Chinese ACE 2005 corpus. We further proposed an accelerator architecture and implemented it on a Xilinx XCKU115 FPGA. Our accelerator adopted a full pipelined architecture, which significantly reduces the latency by combining interlayer and intralayer pipelines. Our experiments show that the accelerator achieved 95.2 GOP/s and 13.4 GOPS/W in performance and energy efficiency, which is 17 and 158 times higher than the GPU. To our knowledge, we are the first to propose an accelerator for natural language processing tasks. Future works should explore how to combine event detection and event arguments extraction and should extend benchmark results to other datasets (e.g., KBPEval2017).

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: