TDJEE: A Document-Level Joint Model for Financial Event Extraction

: Extracting ﬁnancial events from numerous ﬁnancial announcements is very important for investors to make right decisions. However, it is still challenging that event arguments always scatter in multiple sentences in a ﬁnancial announcement, while most existing event extraction models only work in sentence-level scenarios. To address this problem, this paper proposes a relation-aware Transformer-based Document-level Joint Event Extraction model (TDJEE), which encodes relations between words into the context and leverages modiﬁed Transformer to capture document-level information to ﬁll event arguments. Meanwhile, the absence of labeled data in ﬁnancial domain could lead models be unstable in extraction results, which is known as the cold start problem. Furthermore, a Fonduer-based knowledge base combined with the distant supervision method is proposed to simplify the event labeling and provide high quality labeled training corpus for model training and evaluating. Experimental results on real-world Chinese ﬁnancial announcement show that, compared with other models, TDJEE achieves competitive results and can effectively extract event arguments across multiple sentences.


Introduction
Event extraction [1], which aims to identify event arguments which are primary roles composing an event and fill them into corresponding pre-defined event types, is a challenging task in NLP (Natural Language Process). Naturally, event extraction can be divided into two sub-tasks: event argument extraction and event type detection. It has a wide range of applications in the fields such as intelligent question answering, information retrieval [2], automatic summarization [3], recommendation [4], etc.
In recent years, with the increase of financial announcements, it is a labor-intensive task to analyze large amounts of financial announcements manually. Therefore, extracting structured events from financial announcements automatically is very critical for investors to make decisions. However, it is still a challenge to solve the problem of event arguments scatter in different sentences in financial domain. To clearly illustrate such challenge, here is an example shown in Figure 1. Event roles Equity Holder and Pledgee are in Sentence 1, but Start Date and End Date are in Sentence 3. Furthermore, Pledged Shares is in Sentence 5. However, most event extraction models [5][6][7] extract arguments within a sentence, such as the ACE 2005 dataset (https://www.ldc.upenn.edu/collaborations/past-projects/ace, 15 January 2021), a popular event extraction dataset. It is obvious that such models lack the ability to extract event arguments across multiple sentences. Researchers have paid attention at the document-level to challenge the arguments-scattering problem. Huang et al. [8] point out that the pipeline architecture with three-stage task can extract documentlevel context information, but error information propagating will be an obstacle to generate correct results. Zheng et al. [9] use an end-to-end model to identify event arguments at the sentence level first. Then, binary classifiers are used to determine the event type, and event arguments are transformed into a directed acyclic graph. However, it is still hard to solve the problem that event arguments are far apart. In this paper, we propose a relation-aware Transformer-based [10] Document-level Joint Event Extraction model (TDJEE) to address the challenge that event arguments across multiple sentences. TDJEE includes two sub-models: event argument extraction and event type detection. In event argument extraction, word representations are obtained from the BERT [11], and each representation of word contains context information from other sentences. Then, the conditional random fields (CRF) method [12] is used to identify the event arguments. Event type detection is considered as a classification task, which aims to identify the core event sentence from documents and detect the corresponding event type. In the event type detection, we utilize attention mechanism [13] to integrate word representations to get sentence representations. A relation-aware Transformer is used to further capture document-level information.
Moreover, the absence of adequate labeled data in the financial domain could lead to unstable models of low quality which is known as the cold start problem. To train and evaluate our model, we first employ Fonduer [14] to automatically build a domain-specific knowledge base, which stores structured Chinese financial events, then generate training data by distant supervision labeling. Moreover, a matcher and filter are designed to filter noise information. Then, we utilize weak supervisor method to obtain a financial specific Chinese dataset, which is about 60 times larger than ACE 2005.
Experimental results on real-world datasets demonstrate that TDJEE achieves competitive results, and can effectively extract event arguments across different sentences in financial announcements. The implementation codes and datasets used in this paper are available at https://github.com/q5s2c1/TDJEE/tree/master (20 March 2021).
The rest of the paper is organized as follows. Section 2 discusses the related work. In Section 3, we first outline the TDJEE model and give some preliminaries, then details the construction of Fonduer-based knowledge base, distant supervision based event labeling, document encoding, event argument extraction, event type detection, and the optimization techniques, respectively. The experimental results and discussion are in Section 4. Finally, Section 5 draws conclusions and further work.

Related Work
There are two types of event extraction methods: pipeline methods and joint models. Pipeline methods divide the event extraction task into multiple sub-tasks and process them in sequence. Its disadvantage is that the errors between different sub-tasks could be propagated. In order to solve such problems, joint models are proposed for event extraction, and it usually contains joint inference and joint modeling. Joint inference uses the ensemble learning to optimize the models through an overall objective function. Joint modeling regards the event structure as a dependency tree, and then converts the extraction task into a dependency tree structure prediction. It can recognize the trigger words and extract elements at the same time. The inference and modeling share hidden layer parameters that could avoid the performance decreasing caused by error propagation.
From the technique perspective, event extraction methods can also be divided into template-and rule-based methods, deep learning-based methods, and weak supervisionbased methods. Early event extraction methods are based on template matching or regular expressions [15]. Researchers used syntactic analysis and semantic constraints to identify events in sentences. PALKA [16] uses semantic frames and phrase patterns to represent the extraction schema of events. By incorporating the semantic information of WordNet [17], PALKA can achieve results close to human beings in specific domains. However, the performance of template method highly depends on languages and has poor portability.
In recent years, most event extraction methods are based on deep learning. Compared with traditional template-based and rule-based methods, deep learning methods are more general. Chen et al. [6] propose a Dynamic Multi-Pooling Convolution Neural Network (DMCNN) to extract events, which regards event extraction as a two-stage multi-classification task: trigger classification and arguments classification. However, this method suffers from error propagation and ignores the dependency between the trigger word and the arguments. The DEEB-RNN model proposed by Zhao et al. [18] makes full use of document information in event extraction, in which text is encoded by RNN-based fusion hierarchy and attention mechanism, then the representation of text is used to judge event trigger words and types in sentences. However, as the sentence length becomes longer, the RNN-based method cannot work well. Han et al. [19] propose a network model based on the dilate gated convolutional neural network, in which the word representations and depth of the network are expanded to improve the performance. Peng et al. [20] combine CNN with gate linear mechanism to accelerate the encoding of text.
Specially, deep learning models heavily rely on a large-scale training corpus. Weak supervision method is used to reduce event labeling. Chen et al. [21] use a high-quality labeled corpus to train the classifier. Iteratively, the trained classifier is used to label the unlabeled data, and samples with high-confidence are selected to train the classifier. Finally, a high-quality labeled dataset is obtained for event extraction. Liao et al. [22] first use the self-training [23] and semi-supervised learning method to extend the labeled corpus. Then, classifiers in word and sentence granularity are trained simultaneously, and co-training [24] is utilized to extend the labeled data for event extraction.
In financial event extraction, Li et al. [25] propose a method for extracting Chinese financial events by automatically constructing the extraction rules, but this method ignores the background information about entities and relationships. Ding et al. [26] propose a joint event embedding model KGEB, which embeds the knowledge graph into the event vector representation. KGEB uses knowledge bases (such as Freebase and YAGO) to provide two types of background knowledge for event embedding: classification knowledge and relational knowledge, and has achieved the promising results in the task of stock price fluctuation prediction. Dor et al. [27] focus on digging out company-related events from text, which includes articles of describing companies on Wikipedia. Its labeling training corpus is generated through weak supervision. It utilizes a classification model to detect the company-related event sentences. For reports in the financial domain, Ding et al. [28] first define the basic event framework based on expert knowledge, then combine with traditional natural language processing tools such as rules, part-of-speech tagging, and named entity recognition for trigger word recognition and event element recognition, and finally transform the events in the financial reports into a structured form. DCFEE proposed by Yang et al. [29] realizes the extraction of equity freezing, pledge, repurchase, and increase or decrease of holdings in financial announcements. The DCFEE model expands the training corpus by using distant supervision, and divides the event extraction into two-stage sub-tasks. In the first stage, the sentence-level event extraction task is performed. The Bi-LSTM (Bi-direction Long Short-Term Memory) [30] and CRF (Conditional Random Fields) [12] are combined to label the event elements in a sentence, and the event trigger word is detected through the dictionary. In the second stage, the identified event element and event trigger word are concatenated with the current sentence as input, and CNN is used to determine whether the current sentence is an event sentence. Doc2EDAG et al. [9] transform the event into an entity-based directed acyclic graph (EDAG), which can transform the hard slot-filling task into several sequential path-expanding sub-tasks. Moreover, Doc2EDAG designs a memory mechanism for path expanding to support the EDAG generation efficiently.

Model
In this section, we first provide an overview of our model, TDJEE. Then, the construction of Fonduer-based knowledge base is presented. Moreover, we introduce the improved event labeling method based on distant supervision in detailed. Subsequently, more details of TDJEE including document encoding, event argument extraction, and event type detection will be presented. Finally, a loss function of TDJEE is discussed.

TDJEE Overview
The main idea of TDJEE is to incorporate richer context information into sentence representation. Figure 2 shows an overview of the TDJEE model. TDJEE first implements the three-layer document encoding, namely, using BERT to generate the token embeddings for financial documents. Then, the event extraction mainly includes two parts: event argument extraction and event type detection. The event argument extraction is a sequence labeling task. In the event argument extraction model, the document is first encoded by BERT, providing a semantic vector representation of the document. Then, the document representation of high dimensions is mapped to low dimensions through a feedforward neural network, and finally the event argument entities are identified by the CRF layer. Event type detection can be considered as a typical classification task. In the event type detection, a relation-aware Transformer-based encoder is used for capturing context information in multiple sentences. Furthermore, the results of event argument extraction are integrated into sentence representations, which are used in final classification.

Fonduer-Based Knowledge Base Construction
Fonduer is a machine learning-based knowledge base construction (KBC) system for richly formatted data, which uses the deep learning model to automatically capture the representation (i.e., features) [14]. Figure 3 presents the framework of construction of event knowledge base. The construction of the event knowledge base is mainly divided into three phases: data preprocessing, matching and filtering event candidate set by leveraging pre-defined rule-based matcher, and generating multimodal features for the candidate event set and training the weakly supervised classification model.

Data Preprocessing
This paper mainly studies five types of financial announcement events. Different event types use different database to store extracted structured events. The structure of the table in database is corresponding to the type of event ontology.
During data preprocessing, financial announcements of PDF format are converted into HTML fomat by using pdfminer (https://github.com/pdfminer/pdfminer.six, 15 January 2021), and all announcements are input into Fonduer. Data in PDF format are responsible for providing visual information, and data in HTML format are responsible for providing text and structured information. The Fonduer system first parses the HTML format data to identify paragraphs, sentences, tables, etc., and then converts the parsed content into the data model defined in Fonduer. The data model in Fonduer is a directed acyclic graph (DAG), which can clearly describe the hierarchical structure between contexts in the document. Each node in the graph represents a context. The root node of a DAG is the corresponding document. A document can be divided into chapters. The chapters include text, tables, and pictures. The text is divided according to paragraphs, and each paragraph is composed of different sentences. For tabular data, row, column, and header attributes can be segmented. Notice that the announcement data does not contain image contents. The advantage of using Fonduer data model is that it can extract event arguments from different contexts in the document and form a candidate event.

Candidate Event Set Generation
To generate the candidate event set, some matchers and filters are designed first. PersonMatcher, DateMatcher, OrganizationMatcher, and NumberMatcher are used to match named entities such as person's names, dates, organization names, and numbers appearing in sentences, respectively. These named entity identifiers are provided by the spaCy (https://github.com/howl-anderson/Chinese_models_for_SpaCy, 26 March 2021). DictionaryMatcher is used to match event arguments from a pre-defined dictionary. RegexMatchSpan is to match qualified event arguments from sentences based on regular expressions. LambdaFunctionMatcher contains the functions that defined by user to filter the input N-gram characters. Eventually multiple event arguments can be obtained.
As the combination of different entities would form a new candidate event, which could lead to an exponential increase in the number of candidate events. Therefore, there are a large number of negative examples in the generated candidate events. The filter is essentially a set of rule which mining by user from large amounts of announcements. For any type of event, following rules can be used to filter the candidate event: •

Multimodal Feature and Weak Supervised Classification
In order to effectively improve the performance of the classification model, Fonduer generates a multimodal feature for each candidate event, including text features, structural features, table features, and vision features. The text feature is the concatenation of the part of speech and the named entity tag corresponding to the event argument. The structural feature represents the location of the event argument in the original HTML document, the corresponding tags, attributes, and other information. Table features represent information such as the position and attribute distance of event arguments in the table of financial announcements. Vision features are mainly used to indicate whether different event arguments have similar visual features. The multimodal features of the event are concatenated from the above features and used for the weakly supervised classification task.
As shown in Figure 4, in weakly supervised classification, the user first provides weakly supervised labels for the data by labeling functions and then uses the classification model to learn potential relationships from the manually labeled data, uses test data to verify the performance of model, and finally the user adjusts the marking function by observing the performance of model. Such process is iterative. A classification model with better performance will eventually be trained. After using this model to label all candidate events, a structured event knowledge base can be obtained. Here, the marking function is a data programming paradigm, which is to label candidate events in a programmatic way, namely, mark the candidate event as a positive/negative example or skip labeling.  Table 1 shows the event ontology for five types of announcements: Equity Repurchase (ER), Equity Freeze (EF), Equity Underweight (EU), Equity Overweight (EO), and Equity Pledge (EP). The event ontology defines the event role information that needed to form a structured event, including two types of key roles and non-key roles. The key role is the necessary information to form an event, such as the Company Name and Repruchased Shares in ER. Without any key role, a complete structured event can not be formed.

Distant Supervision Based Event Labeling
Motivated by distant supervision data labeling method based on the knowledge base [31], we first construct Fonduer-based financial event knowledge base, and then align the knowledge base to the documents (i.e., announcements) with distant supervision method to obtain a large amount of labeled data.
Using the distant supervision method to align the knowledge base with unstructured text can automatically build a large amount of training data, reducing the dependence on manual annotation data. The disadvantage of this approach is that the assumption is too positive, and a lot of noise will be introduced. In event extraction, an event contains multiple event arguments, and the event arguments may scatter in multiple sentences. Therefore, the traditional distant supervision method is not suitable for the labeling of event arguments. In this paper, we improve the alignment of the knowledge base and unstructured text in the distant supervision method, and use the directional link method to match the knowledge in the knowledge base with the text, which is shown in Figure 5. Specifically, we use a PDF parsing tool to convert the announcement document in PDF format into text format. Then, we clean the text data. Data cleaning is done to standardize the converted text, which mainly includes filtering garbled characters and segmenting sentences according to special symbols. Then, we retrieve structured events in the document from the knowledge base according to the document ID, and tag the event arguments with BIO (Beginning, Inside, Outside) format. The annotation results are filtered by two conditions: (1) an event must contain a key role (e.g., person or company) defined by event ontology and (2) the document and the event ontology have same numbers of event roles. Finally, we obtain the corpus with labeled data for five types of event: Equity Freeze (EF), Equity Repurchase (ER), Equity Underweight (EU), Equity Overweight (EO), and Equity Pledge (EP).

Document Encoding
Recently, pre-training language models have achieved good results on many natural language processing tasks. In this paper, we utilize BERT to embed the document for capturing context information beyond the sentence boundary. Before encoding the document with BERT, we first segment the document D into several sequences {S 1 , S 2 , S 3 , . . . , S q }, where q is the number of sentences. The length of each sequence is less than 128. However, directly truncating sentences with a length greater than 128 causes the incompleteness of event arguments in predicting. We select the punctuation mark with the largest index value less than 128 as the cutting position of the sentence. As the number of sentences in different documents is different, in order to be able to use the BERT model to batch calculate the vector representation corresponding to each character in the sentence, we limit the maximum number of sentences allowed in the document and fix the input of each document to two-dimensional matrix of 64 × 128. That is, the number of sentences does not exceed 64, and the length of each sentence does not exceed 128. If the number of sentences exceeds the limit value, it will be truncated directly. Otherwise, if the number of sentences is less than the limit value, the blank rows will be filled with <PAD> characters, and combined with a mask to indicate the actual sentence length and number. Finally, the document can be represented as Equation (1): where E W i ∈ R 768 is the representation of i-th word and l is the number of words in the document.

Event Argument Extraction
The event argument extraction can be treated as a sequence labeling task. Generally, B-LABEL is used to mark the beginning part of entity with type of LABEL, and I-LABEL to mark the middle and tail parts. However, the labels with the maximum score may be wrong when simply using BERT for sequence labeling task. It could cause that I-LABEL1 followed directly by I-LABEL2, or the label begins with I-X. To address such issues, we introduce CRF into our model to improve the performance of sequence labeling task (i.e., event argument extraction).
In this stage, BERT first provides a representation for each word. Then, the model learns the state characteristics of the sequence through a fully connected neural network to obtain a state score matrix C ∈ R m×n with the following equation: where n is the number of words in a sentence (i.e., 128). Next, the score is input into the CRF layer. Finally, the CRF layer learns a transition score matrix T ∈ R m×m and computing all path (i.e., possible tag sequences) scores as Equations (3) and (4): where m is the number of label types, n is the length of the sentence (in this case, 128), C i,y i is the score of the tag y i of the i − th token in the sequence, and T y i ,y i+1 represents the score of a transition from the tag y i to tag y i+1 . A softmax function is applied over scores for all path to get the probabilities {P 1 , P 2 , P 3 , . . . , P N }, where N is the number of paths.

Event Type Detection
Event type detection is processed in sentences. As different event arguments in the same event type may scatter in multiple sentences in event extraction, we detect the event type of sentences by a utilizing Transformer-based encoder for further capturing context information between different sentences. However, a documents may contain lots of sentences. Transformer may be hard to capture so difficult inner dependencies. Inspired by [32], we first modify the equation of self-attention in the Transformer as Equations (5)- (7): where the r ij terms encode the known relation between two input elements: x i and x j .
Other parameters are the same as self-attention [10]. Specifically, W i is the output of self-attention. Then, we defined two undirected edges (relations): r same and r di f f erent . If event roles in S i and S j are able to appear in an event, S i and S j will be connected by edge r same as shown in Figure 6. Otherwise, we add an edge r di f f erent to connect them. r same and r di f f erent can be learned during training. In this way, some inner relations are exposed and encoded into context information. Besides, we call the modified Transformer above as relation-aware Transformer in this paper.

Pledgee Start Date
Start Date r same r same r different S3 Figure 6. An example of relation-aware. Start Date and Pledgee can appear in an event like EO, but Start Date in S1 and Start Date in S2 can not appear in an event.
In event type detection, we first use the attention mechanism to integrate word representations to get sentence representations. The document can be represented by a set of sentence representations {E att S 1 , E att S 2 , E att S 3 , . . . , E att S q }, where q is the number of sentences. Then, the representations refined by relation-aware Transformer are denoted as {Ê S 1 ,Ê S 2 ,Ê S 3 , . . . ,Ê S q }. In the attention network, Query Vector is hidden state of a word, and key i along with value i are corresponding to word representation.
Intuitively, event arguments are primary elements of events, and may provide important information for event type detection. Therefore, we incorporate the results of event argument extraction {P S i−1 , P S i−2 , . . . , P S i−n } with refined representationÊ S i . Moreover, each vector P S i−k corresponds to an event role in the sentence, which can be learned during training. Finally, we consider event type detection as a binary classification task, and defines five discriminant models for five event types (i.e., Equity Freeze (EF), Equity Repurchase (ER), Equity Underweight (EU), Equity Overweight (EO), and Equity Pledge (EP)) for classification. Each discriminant model is composed of a simple feedforward neural network. The probability of a sentence for an event type can be calculated by Equation (8): where W is trainable parameter and σ is sigmoid function. The event type of a document is determined by type with the highest votes.

Optimization
Both the event type detection and the event argument extraction share the same document encoding layer. The event argument extraction calculates the loss value of the CRF layer in units of sentence, as shown in Equation (9): where P real represents the score of ground truth. The event argument extraction loss of the document is summation of the loss of all sentences. The event type detection model uses cross entropy denoted as loss e to calculate the classification loss for each event type. Equation (10) shows the overall loss function of the event extraction model: where q represents the number of valid sentences in the document, k represents the number of event types, and λ is hyperparameter used to adjust the loss ratio of event argument extraction and event type detection tasks.

Experiments
In this section, we present the experimental results of TDJEE. We first describe implementation details, including the quality of knowledge base and hyperparameter setting. Then, we show experimental results including the baseline, knowledge base construction, main results of our method, and ablation performance.

Datasets
The experimental data are company announcements from 2008 to 2018 trawled from the Chinese financial portal East Money (http://data.eastmoney.com/notices/hsa.html, 10 January 2020). Table 2 shows the number of announcement documents crawled for constructing knowledge base and the statistics of the dataset used in this paper. We obtain 31,748 documents of five event types in total, of which 5900 are used to manually labeled to verify the quality of dataset and 25,848 are used for distant supervision labeling. Then, the dataset is randomly shuffled and divided into training set, testing set, and validation set at a ratio of 8:1:1 for evaluating the TDJEE model. In our implementation, we set the maximum number of sentences and the maximum sentence length to 64 and 128, respectively. Furthermore, the dimensions of hidden layer and output layer of feedforward in sub-task event argument extraction are set to 1024 and 43, respectively. During training, we set λ = 0.2. We employ the Adam [33] optimizer with the learning rate 10 −4 , and train for at most 30 epochs. The batch size is set to 30 and the dropout rate is set to 0.1. More details settings of hyperparameters can be found in Table 3.

Baseline
We use TIER [8], DCFEE [29], and Doc2EDAG [9] models as baseline models. TIER employs a pipeline architecture with three-stage task to get document-level context information: classifying narrative document, recognizing event sentence and noun phrase analysis. DCFEE divides the event extraction into two-stage tasks for processing. The first stage is sentence-level event extraction task, using bidirection LSTM [30] and CRF to mark the event arguments in the sentence. Event trigger words are detected through a dictionary. In the second stage, a convolution neural network (CNN) is used to identify event sentence and completeness of arguments. Doc2EDAG uses an end-to-end method to identify event arguments at the sentence level first. Then binary classifiers are used to determine the event type, and event arguments are transformed into a directed acyclic graph.

Performance of Knowledge Base Construction
We use distant supervision to label the structured event data to obtain the training corpus. Therefore, the quality of the knowledge base could influence the quality of labeled data. As Fonduer uses weak supervision to label data, it cannot directly verify the quality of the labeling function. Therefore, the quality of the dataset can be indirectly verified by testing the performance of the classification model in Fonduer.
First, an experiment of threshold in the classification model is studied. Figure 7 shows the influence of classification threshold. It can be seen that (1) with same threshold 0.65, three types of events-ER, EF, and EO-have achieved the highest F1 score, and (2) EP has the worst performance under this threshold. Therefore, we use different thresholds for different types of events to maximize the performance of all models. After obtaining the best classification model, we further perform an experiment on five types of event to verify the quality of event labeling. As shown in Table 4, it can be observed that event labeling on ER have performance with 83.3% F1 score. The F1 scores on other three event types are a little lower than that on ER. This may owe to more noise brought by labeling function. In addition, simply increasing the number of labeling function cannot improve the quality of event labeling.

Main Results
Besides the pretrained model BERT, in our experiments, we also replace BERT with Bi-LSTM as embedding layer in TDJEE verify the advantage of our method. As Table 5 shows, TDJEE performs better than baseline models for almost all types of event. Specifically, compared with DCFEE, TDJEE improves 9.53%, 18.79%, 13.24%, 18.17%, and 22.49% F1 scores on ER, EF, EP, EO, and EU, respectively.   In our dataset, event arguments scatter in multiple sentences. Meanwhile, an event role may appear in multiple events. Although DCFEE uses contextual information to predict event triggers, its context-agnostic arguments completion strategy makes it unable to effectively solve the event arguments scattering problem. Moreover, the direct document-level supervision are more robust than the extra sentence-level supervision used in DCFEE, which assumes that the sentences having most event arguments would contain the key event. This assumption does not work well on some event types, such as EF, EU, and EO. When event arguments appear in different sentences, original Transformer is hard to capture inner dependencies. Therefore, Doc2EDAG may fail to extract such event. In TIER, event extraction is divided into multi-stage tasks, but error propagation between multiple models is ignored. TDJEE uses joint learning to alleviate the influence of error propagation. Furthermore, even replacing BERT with Bi-LSTM, our method is still better than most baseline models. It mainly dues to that TDJEE can effectively capture contextual information, and integrate this information to form document-level contextual information.

Ablation Performance
To evaluate different parts of TDJEE, we perform an ablation experiment to compare three different variants: (1) CRF: removing the CRF layer from event argument extraction model. (2) Relation Encoding: removing relation encoding from Transformer, that is, replacing relation-aware Transformer with original Transformer. (3) Context Embedding: removing the context information from event type detection model, that is, removing relation-aware Transformer layer. The experimental results are shown in Table 6.  17.15%, and 20.64% on five event types, respectively. In event argument extraction model, CRF can correct outputs from BERT and reduce the occurrence of error propagation. Therefore, the effect of the TDJEE model can be improved by adding CRF layer. In event type detection model, explicitly encoding relations can not only help information flow in related sentences, but also enhance the weight of such relations in inner dependencies. This is important for downstream tasks. Furthermore, the influence of error propagation at event argument extraction stage can be alleviated. In addition, contextual information is crucial for the detection of event type. For an event role may appear in different events, ignoring the context information could lead to misclassification, reducing the performance of the model.

Evaluation of Knowledge Base and Hyperparameters
We compare the performance of the event extraction model under corresponding event knowledge base. As shown in Figure 8, the F1 score of knowledge base in ER is the highest, and the corresponding event extraction model has the best performance. Although EP is 4.2 times the size of ER in terms of dataset, the performances of knowledge base construction and event extraction model are lower than that of ER. Therefore, simply increasing the size of the corpus cannot effectively improve the performance of the model, and it is also necessary to improve the quality of knowledge base at the same time. EF and EP have very close performance of knowledge base, and their difference in the performance of the event extraction model between the two is less than 1%. EO and EF also have close performance of the knowledge base construction, but the performance of the event extraction model has a difference of 3%. Therefore, although different event types are close to each other in the performance of knowledge base construction, there may still be a large gap between the performance in event extraction model. At the same time, in order to study the influence of hyperparameters in TDJEE model, we compared the P, R, and F1 scores with different hyperparameter λ and batch size. As shown in Figure 9a, when the value of λ is 0.2, the performance of the model is the highest, and when the values of λ are 0.3 and 0.4, the effect of the model shows a downward trend. As the input of the event extraction model is based on the document, the number of event arguments in the document is more than the event sentence. Therefore, the loss generated by the accumulation of event arguments is greater than that generated by event type detection, and a smaller value of λ can effectively balance the losses, thereby maximizing the performance. Figure 9b shows the P, R, and F1 scores corresponding to the TDJEE model with different batch sizes. The training data set in this paper is processed on GTX 2080 GPU. As the batch size increases, the model converges faster under the same number of epoch. When the batch size is 8, it will take at least 30 epochs to converge. When the batch size is 32, the model has tended to converge after 20 epochs. At the same time, the F1 score of the model also increases as the batch size increases.

Conclusions
In this paper, we first propose a new document-level joint event extraction model, TDJEE, to address the challenge of event arguments scattered in financial field. TDJEE includes two sub-models: event argument extraction and event type detection. Event argument extraction is regarded as a sequence labeling task. BERT and CRF are used for event argument extraction and role labeling. Event type detection is regarded as a binary classification task. Five discriminant models are defined for corresponding five event types. In addition, a relation-aware Transformer and attention network are used to further capture the semantic information of the document-level context. We build a Fonduer knowledge base and utilize a distant supervision method to label a large amount of high-quality corpora for training and evaluating.
Experimental results on real-world Chinese financial announcement dataset show that our model performs better than baseline models and achieves competitive results. It is worth mentioning that TDJEE is the language-independent model, and it can be used for financial event extraction in other languages.
Furthermore, our experimental results show that the quality of knowledge base could significantly effect the performance of event extraction in two sides. First, the errors in event knowledge base could cause wrong labels during distant supervision labeling, then wrong labels could affect the performance of TDJEE model. Especially, the quality of our current knowledge base is not perfect and has wrong event knowledge. Therefore, in the future, manual verification or reinforcement learning will be introduced to improve the quality of knowledge base. Second, as a joint model, even TDJEE overcomes the error propagation problem in the pipeline models, but it still suffers from the problem of argument extraction bottleneck, which is also worth exploring in the future work.