Chinese Event Detection without Triggers Based on Dual Attention

: In natural language processing, event detection is a critical step in event extraction, aiming to detect the occurrences of events and categorize them. Currently, the defects of Chinese event de‑ tection based on triggers include polysemous triggers and trigger‑word mismatches, which reduce the accuracy of event detection models. Therefore, event detection without triggers based on dual attention (EDWTDA), a trigger‑free model that can skip the trigger identification process and de‑ termine event types directly, is proposed to fix the problems mentioned above. EDWTDA adopts a dual attention mechanism, integrating local and global attention. Local attention captures key se‑ mantic information in sentences and simulates hidden event trigger words to solve the problem of trigger‑word mismatch, while global attention digs for the context of documents, fixing the prob‑ lem of polysemous triggers. Besides, event detection is transformed into a binary classification task to avoid problems caused by multiple tags. Meanwhile, the sample imbalance brought about by the transformation is settled with the application of the focal loss function. The experimental re‑ sults on the ACE 2005 Chinese corpus show that, compared with the best baseline model, JMCEE, the accuracy rate, recall rate, and F1‑score of the proposed model increased by 3.40%, 3.90%, and 3.67%, respectively.


Introduction
Event extraction is an important task in natural language processing, and event detection is a crucial subtask of event extraction with the purpose of detecting the occurrences of predefined events and categorizing them.As a basic core technology in the field of artificial intelligence, event detection is widely used in the construction of event graphs and the generation of text summaries.High-quality, structured knowledge information in event detection can guide our intelligent model to have a deeper understanding of things, more accurate task queries, and logical reasoning abilities to a certain extent, thus playing a crucial role in the analysis of massive information.An event is composed of trigger words and event arguments.However, event trigger words have the problem of polysemy and trigger-word mismatch.As a result, the event detection method with trigger identification as its core is prone to classification error.
A trigger word with multiple senses can be classified into different event types in different contexts.As is shown in Figure 1, "维和部队向抗议者释放了催泪弹" and "开庭审理之后应当当庭释放" represent Chinese sentences.The first trigger word "释放" (release) represents an attack event (release tear gas), and the second trigger word "释放" (release) represents a release-parole event (release a man in court).The same trigger word expresses different semantic information in different contexts.Therefore, the rich semantic information of words limits the accuracy of event classification, and the true meaning of "释放" can only be determined with the help of contextual information.Due to the lack of natural separators in Chinese, word segmentation is a necessary pre-processing step in event detection.However, Chinese event trigger words can be either part of a word or an intersection of multiple words.Examples are given in Figure 2. "这家公司并购了多家公司" and "那个受了伤的士兵不治身亡" represent Chinese sentences.The two characters in the word "并购" (mergers and acquisitions) can trigger two different events."并" (merge) triggers a merge event."购" (purchase) triggers a transfer of ownership event.So, it is impossible to find the corresponding label in the corpus.Similarly, "受了伤" (injured) does not exactly match the trigger word "受伤" (injured).In this case, the event detection method based on trigger identification cannot locate the trigger correctly, which reduces the accuracy rate and the recall rate of event classification.The research on event detection was carried out earlier.In the early stage, featurebased methods [1,2] were mainly adopted to detect the words that can most effectively express the occurrences of events in sentences (that is, the event trigger words) and classify the event types through manually designed features.However, these methods require a lot of manpower and time to design corresponding functions and lack generalization ability.With the further development of research, it has become a trend to use deep learning methods to automatically mine text features.
The dynamic multi-pooling convolutional neural network (DMCNN) employs dynamic multi-pooling layers to retain more critical information and classify each word in a sentence to identify event trigger words [3].Attention mechanisms can provide semantic information in sentences [4].Dependency relationships can reflect the mutual relationship between candidate trigger words and related entities [5,6].The graph structure is constructed based on dependency relationships, and the use of a graph convolutional network (GCN) can avoid the attention of irrelevant words in the convolution process [7].Incremental learning can enable the model's detection of new event types on new datasets [8].Since Chinese is more complex than English, polysemous trigger [9] and trigger-word mismatch [10] remain problems in event trigger identification.Nevertheless, trigger identification is only a dispensable intermediate step of event detection.The attention mechanism can simulate event trigger words and detect events in the absence of trigger words Due to the lack of natural separators in Chinese, word segmentation is a necessary pre-processing step in event detection.However, Chinese event trigger words can be either part of a word or an intersection of multiple words.
Examples are given in Figure 2. "这家公司并购了多家公司" and "那个受了伤的士兵不治身亡" represent Chinese sentences.The two characters in the word "并购" (mergers and acquisitions) can trigger two different events."并" (merge) triggers a merge event."购" (purchase) triggers a transfer of ownership event.So, it is impossible to find the corresponding label in the corpus.Similarly, "受了伤" (injured) does not exactly match the trigger word "受伤" (injured).In this case, the event detection method based on trigger identification cannot locate the trigger correctly, which reduces the accuracy rate and the recall rate of event classification.Due to the lack of natural separators in Chinese, word segmentation is a necessary pre-processing step in event detection.However, Chinese event trigger words can be either part of a word or an intersection of multiple words.Examples are given in Figure 2. "这家公司并购了多家公司" and "那个受了伤的士兵不治身亡" represent Chinese sentences.The two characters in the word "并购" (mergers and acquisitions) can trigger two different events."并" (merge) triggers a merge event."购" (purchase) triggers a transfer of ownership event.So, it is impossible to find the corresponding label in the corpus.Similarly, "受了伤" (injured) does not exactly match the trigger word "受伤" (injured).In this case, the event detection method based on trigger identification cannot locate the trigger correctly, which reduces the accuracy rate and the recall rate of event classification.The research on event detection was carried out earlier.In the early stage, featurebased methods [1,2] were mainly adopted to detect the words that can most effectively express the occurrences of events in sentences (that is, the event trigger words) and classify the event types through manually designed features.However, these methods require a lot of manpower and time to design corresponding functions and lack generalization ability.With the further development of research, it has become a trend to use deep learning methods to automatically mine text features.
The dynamic multi-pooling convolutional neural network (DMCNN) employs dynamic multi-pooling layers to retain more critical information and classify each word in a sentence to identify event trigger words [3].Attention mechanisms can provide semantic information in sentences [4].Dependency relationships can reflect the mutual relationship between candidate trigger words and related entities [5,6].The graph structure is constructed based on dependency relationships, and the use of a graph convolutional network (GCN) can avoid the attention of irrelevant words in the convolution process [7].Incremental learning can enable the model's detection of new event types on new datasets [8].Since Chinese is more complex than English, polysemous trigger [9] and trigger-word mismatch [10] remain problems in event trigger identification.Nevertheless, trigger identification is only a dispensable intermediate step of event detection.The attention mechanism can simulate event trigger words and detect events in the absence of trigger words The research on event detection was carried out earlier.In the early stage, featurebased methods [1,2] were mainly adopted to detect the words that can most effectively express the occurrences of events in sentences (that is, the event trigger words) and classify the event types through manually designed features.However, these methods require a lot of manpower and time to design corresponding functions and lack generalization ability.With the further development of research, it has become a trend to use deep learning methods to automatically mine text features.
The dynamic multi-pooling convolutional neural network (DMCNN) employs dynamic multi-pooling layers to retain more critical information and classify each word in a sentence to identify event trigger words [3].Attention mechanisms can provide semantic information in sentences [4].Dependency relationships can reflect the mutual relationship between candidate trigger words and related entities [5,6].The graph structure is constructed based on dependency relationships, and the use of a graph convolutional network (GCN) can avoid the attention of irrelevant words in the convolution process [7].Incremental learning can enable the model's detection of new event types on new datasets [8].Since Chinese is more complex than English, polysemous trigger [9] and trigger-word mismatch [10] remain problems in event trigger identification.Nevertheless, trigger identification is only a dispensable intermediate step of event detection.The attention mechanism can simulate event trigger words and detect events in the absence of trigger words [11].In addition, event detection and extraction can be transformed into a question-and-answer task, extracting the event arguments directly in an end-to-end manner [12].
In summary, in order to deal with the problems of polysemous triggers and triggerword mismatch, this paper puts forward EDWTDA, a model that adopts local attention to capture keywords and sentence-level semantic information, simulating hidden event trigger words, as well as global attention to mine the context of documents, determining the true meaning of words.Eventually, it will be possible to skip the trigger identification and determine the event type directly.For example, S1 says, "The car accident was so severe that he did not wake up despite the doctors' best efforts", and S2 says, "He left his family with remorse from then on".Without considering the full context, S2 may express both a death event and a position change event.However, by considering the document information at the same time, the event type can be accurately classified as a death event.Therefore, the context of the full text can assist in determining the event type, which is of great significance for event detection and classification.
The main contributions of this paper are as follows: (1) We propose an event detection without triggers based on dual attention to address the problems of polysemous triggers and trigger-word mismatch in Chinese event detection methods based on trigger identification; (2) We transform the event detection task into a binary classification task.We use the focal loss function to solve the sample imbalance and gradient disappearance brought by the transformation; (3) We conducted extensive experiments on the ACE 2005 Chinese dataset and the dam safety operation log dataset.The experimental results show that our method outperforms existing classical event detection methods.
The remainder of this paper is structured as follows: Section 2 investigates the literature review related to event detection.Section 3 then introduces the event detection method without triggers.Section 4 compares our model's results to those of other traditional methods.Section 5 concludes with some recommendations.

Event Detection Based on Trigger Identification
Deep learning has recently become popular for automatically mining text features.Nguyen and Grishman used a convolutional neural network (CNN) to represent the window around candidate trigger words [13] and used the global maximum pool to summarize the extracted features before passing them to a linear classifier for event classification.Subsequently, discontinuous convolution was introduced to skip insignificant words in word sequences to improve the accuracy and recall of event detection [14].Although CNN could effectively capture syntactic and semantic information between words in a sentence, it could not encode the meaning of words in different contexts.Hybrid neural networks combined CNN and BiLSTM to capture sequence and block information in a specific context for prediction [15], but they were still incapable of disambiguating certain events [16].The gated multi-lingual attention (GMLATT) system introduced attention mechanisms to mine coherent information in multilingual data [4], alleviating data scarcity and monolingual ambiguity.
On the basis of deep learning, adding features related to event trigger words can improve the accuracy rate and recall rate of recognition.For example, combining the sequential features of entity types and word sequence features can help the model find the insentence triggers [17].For event detection, the syntactic relationship representation based on dependency trees reflects the interrelationship between candidate trigger words and related entities better than the sentence representation.Dependency tree-based graph convolutional networks and attention mechanisms allowed explicit modeling and aggregation of multi-order syntactic representations in sentences [5].However, this ignored the dependency label information.Viet used gate control to make up for the insufficient consideration of candidate keywords in GCN, and he calculated the correlation with trigger words to indicate the importance of words [18].Dutta integrated dependency relationships between words in syntactic parsing trees and dependency labels into graph transformer networks (GTN) to improve the precision and complexity of the model [19].The scarcity of annotated data is a great challenge for event detection [20].By averaging the performance of computing event prototypes and consuming only one of the events mentioned, the dynamic-memory-based prototyping network (DMB-PN) can produce more robust sentence encoding and solve the few-shot problem [21].Viet et al. have made full use of the matching information between supporting samples in few-shot learning by calculating the intra-class and inter-class distance based on metric learning [22].Known types of annotations can be used for semi-supervised learning to reduce manual labeling costs [23].
Likewise, external open-domain trigger knowledge can alleviate data sparsity and reduce the built-in bias of high-frequency trigger words in annotations [24], but it relies on complex, predefined rules and existing instances in the knowledge base and has problems such as a low coverage rate, subject bias, and data noise.Targeting the above problems, Wang et al. constructed a large event-related candidate set with high coverage and then applied an adversarial training mechanism to iteratively identify informative events from the candidate set while filtering out noisy events [25].
The traditional ACE event detection methods treat multiple events in a sentence as independent events and identify them separately using sentence-level information.However, events in a sentence are typically interdependent, and sentence-level information is frequently insufficient to reduce the ambiguity of certain events.Hierarchical and bias tagging networks with gated multi-level attention mechanisms (HBTNGMA) can simultaneously solve these two problems, realize automatic extraction and dynamic fusion of sentence-level and document-level information, and detect multiple events in a sentence [26].It is evident that document-level information is vital for event detection.Document embedding enhanced bi-directional recurrent neural network (DEEB-RNN) uses document embedding to enhance bi-directional recurrent neural networks, helping identify event trigger words and their types [27].
In Chinese sentences, event trigger words may appear within or between words after word segmentation.This problem can be solved by converting them into character sequence markers [28].The above problems can be classified as trigger-word mismatch, and the polysemy of Chinese trigger words will affect event classification, which can be remedied by a trigger-aware lattice neural network (TLNN) [10].To avoid the mismatch of word triggers, TLNN dynamically combines words and characters.Meanwhile, it mitigates the ambiguity of polysemous words by modeling them with the external language knowledge base.In fact, TLNN can only resolve the intersection of multiple words but cannot discriminate between unannotated corpora.The scarcity of annotated data remains a significant challenge for event detection.At the moment, external open domain-triggered knowledge [24], few-shot learning [21], and adversarial generative networks [25] primarily alleviate the effect of the few-shot problem, but low coverage, topic bias, and data noise remain problems.

Event Detection without Trigger Words
Event trigger word recognition is an optional intermediate step in event detection.Type-aware bias neural network with attention mechanisms (TBNNAM) attempts to detect events without trigger words by simulating hidden event trigger words and has demonstrated its effectiveness [11].However, the skip-gram embedding used in TBNNAM cannot consider contextual information.Sentence-level information is usually insufficient to guarantee disambiguation in certain types of events and cannot avoid the trigger words' ambiguity (i.e., polysemous problems).And the biased MSE loss function used in TBN-NAM is prone to having a gradient disappearance problem in the classification task, which may bring difficulties to network training.Documents can provide context to assist in determining the true meaning of words [26].Furthermore, the end-to-end model Doc2EDAG implements document-level event extraction without trigger words [29].Du et al. for-mulate event detection as a question-and-answer task to extract event arguments in an end-to-end method, avoiding error propagation by relying on entity recognition [12].This approach, however, is based on complex, predefined rules and is less effective in the case of a sparse corpus.

Event Detection without Triggers Based on Dual Attention
Figure 3 shows the event detection architecture of EDWTDA, covering the ALBERT embedding layer, feature construction layer, BiLSTM layer, attention layer (local attention and global attention), fusion gate layer, and sigmoid layer."玉溪市通海县发生五级地震" represents the input Chinese sentence."地震(事件类型)" represents the event type.
The ALBERT embedding layer transforms sentences and documents into embedding vectors, which can selectively use information from all levels and solve the polysemy problem via traditional word embedding methods.The word embedding vector, named entity recognition type, and lexical annotation are stitched together to form the feature construction layer.This aids in increasing the attention weight score of keywords and filtering out irrelevant words.The BiLSTM layer can effectively capture the semantic information of each word.The attention layer includes local attention and global attention.The BiLSTM layer combines local attention to mine keywords, simulate hidden event trigger words, and avoid the trigger-word mismatch.To avoid the ambiguity of trigger words, global attention contains not only sentence-level semantics but also document-level context.The fusion gate layer computes the proportion of local and global attention vector weights and determines the event type of the sentence using the sigmoid function.The sigmoid layer solves the problem of sample imbalance with the focal loss function.
Furthermore, the input layer consists of a complete sentence with event information (Sentence), the complete document containing that sentence (Document), and the event type of the sentence (Table Map)."Sentence" is input to the encoding layer for encoding operations, and the sentence-level semantic information would be captured from its output in the local attention layer."Document" is input to the global attention layer for capturing document-level semantic information."Table Map" is input into the dual attention layer to assist the attention network in event detection without trigger words.
The ALBERT embedding layer and feature construction layer will be described in Section 3.1, the BiLSTM layer, attention layer, and fusion gate layer in Section 3.2, and the remaining parts in Section 3.3.model Doc2EDAG implements document-level event extraction without trigger words [29].Du et al. formulate event detection as a question-and-answer task to extract event arguments in an end-to-end method, avoiding error propagation by relying on entity recognition [12].This approach, however, is based on complex, predefined rules and is less effective in the case of a sparse corpus.

Event Detection without Triggers Based on Dual Attention
Figure 3 shows the event detection architecture of EDWTDA, covering the ALBERT embedding layer, feature construction layer, BiLSTM layer, attention layer (local attention and global attention), fusion gate layer, and sigmoid layer."玉溪市通海县发生五级地震" represents the input Chinese sentence."地震(事件类型)" represents the event type.
The ALBERT embedding layer transforms sentences and documents into embedding vectors, which can selectively use information from all levels and solve the polysemy problem via traditional word embedding methods.The word embedding vector, named entity recognition type, and lexical annotation are stitched together to form the feature construction layer.This aids in increasing the attention weight score of keywords and filtering out irrelevant words.The BiLSTM layer can effectively capture the semantic information of each word.The attention layer includes local attention and global attention.The BiLSTM layer combines local attention to mine keywords, simulate hidden event trigger words, and avoid the trigger-word mismatch.To avoid the ambiguity of trigger words, global attention contains not only sentence-level semantics but also document-level context.The fusion gate layer computes the proportion of local and global attention vector weights and determines the event type of the sentence using the sigmoid function.The sigmoid layer solves the problem of sample imbalance with the focal loss function.
Furthermore, the input layer consists of a complete sentence with event information (Sentence), the complete document containing that sentence (Document), and the event type of the sentence (Table Map)."Sentence" is input to the encoding layer for encoding operations, and the sentence-level semantic information would be captured from its output in the local attention layer."Document" is input to the global attention layer for capturing document-level semantic information."Table Map" is input into the dual attention layer to assist the attention network in event detection without trigger words.
The ALBERT embedding layer and feature construction layer will be described in Section 3.1, the BiLSTM layer, attention layer, and fusion gate layer in Section 3.2, and the remaining parts in Section 3.3.

Embedding Vector
In the feature construction layer, the feature vector W is composed of three parts: the word embedding vector, the entity type embedding vector, and the lexical annotation embedding vector, which is shown in Figure 3.The word embedding converts text into mathematical representations so that similar words have similar vector representations.The word embedding can capture meaningful semantic patterns in words [30,31] and assist neural networks in processing text.In this paper, we use ALBERT [32] as a word embedding model to transform the Chinese sentences containing event information and the document where the sentences are located into feature vectors.Unlike word2vec [33], ALBERT generates embedding vectors by pre-training on a large corpus first and then fine-tuning training with specific small datasets.ALBERT dynamically learns context so that word embedding vectors have richer semantics and solves the polysemous problem in word2vec.Compared with BERT [34], ALBERT has fewer training parameters, which means it is faster and could work better after expanding the model's depth.It is one of the best word-embedding models.
The purpose of event detection is to classify events through the identification of keywords (i.e., event trigger words).Both Chinese and English words have the polysemous problem.The embedding vectors generated by ALBERT are rich in meaning, which can alleviate the polysemy problem and play an important role in event detection.
Entity types can be converted to low-dimensional vectors by looking up a randomly initialized embedding table.Event trigger words are mostly verbs with a few nouns, while entities are the opposite.Therefore, embedding vectors of entity types can improve the probability of trigger word recognition.Word annotation can be done by word meaning and context.In this paper, we use the Stanford CoreNLP to label the lexical properties of each word and embed a table to convert them into low-dimensional vectors.Since event trigger words are mostly verbs or nouns, the lexical annotation embedding vector can help the model filter out unnecessary words.
The events in sentences are pre-trained by incorporating dual attention layers.The event type (Table Map) is transformed into two embedding vectors, t 1 and t 2 , and is used to assist local attention in capturing key information in a sentence.t 1 is used to simulate hidden event trigger words.t 2 is used to support global attention by capturing context in documents and avoiding the ambiguity of trigger words.

Trigger Simulation and Contextual Analysis
Each type of event is triggered by a specific set of words called event trigger words.For example, seismic events are usually triggered by words like "earthquake", "shaking" and "trembling".Therefore, the event trigger words are important clues for this task.To avoid the trouble caused by event trigger words, we add local attention to simulate event trigger words so as to detect events without trigger words.BiLSTM captures the meaning of each word in the sentence.In addition, BiLSTM combines local attention to find the key information in the semantics and assigns each word weight according to the degree of importance.The word with the highest weight will be defaulted as the hidden event trigger word to prevent mismatch.
BiLSTM is suitable for modeling sequence data.It can capture word context using forward and backward LSTM units, thus effectively capturing the semantics of each word in a sentence.We input the feature vector W into BiLSTM and output hidden states To make full use of the semantics in hidden states, they are synthesized as . The local attention vector α s is generated by the output vector h and the event type embedding vector t 1 , as shown in Equation ( 1).
where h k is the kth part of the output vector h, α k s is the kth part of the local attention vector α s , and t T 1 is the transpose of the event type embedding vector.α s can mine the keywords in the sentence, so that the attention weight of the word that triggers the target event type is higher than other words, achieving the goal of simulating trigger words.
Although local attention can solve the problem of trigger-word mismatch, it still cannot solve the polysemous problem.Global attention is introduced to solve the problem of ambiguous triggers.By learning the keywords in the sentence and the contextual information in the document, the unique meaning of the trigger word in the scene can be obtained to help judge the event type.
The global attention embedding vector α d consists of three parts: the output vector h, the event type embedding vector t 2 and the document-level embedding vector d obtained by Doc2Vec, which is shown in Equation (2).
where h k is the kth part of the output vector h, α k d is the kth part of the global attention vector α d , t T 2 is the event type embedding vector transpose, and d T is the document-level embedding vector transpose.
Although global attention can solve the polysemous problem, it dilutes the weight of keywords and impairs the model's judgment ability when there is too much noisy data in the document.Therefore, with their advantages and disadvantages, local and global attention cannot solve all problems alone.We weighted local and global attention to improve the event detection accuracy, calculated as shown in Equation (3).
Among them, the final output value o consists of two components v s and v d .v s is generated by the dot product of α s and t 1 for capturing local features and simulating hidden event trigger words; v d is generated by the dot product of α d and t 2 for capturing global features and context.σ is a sigmoid function, and λ ∈ [0, 1] is a hyper-parameter that weighs between v s and v d .

Model Training 3.3.1. Binary Classification and Identification
The multi-label problem arises from the fact that each sentence can contain an arbitrary number of events, implying that it can have zero or more target labels.Reconstructing training samples can transform event detection from a multiple classification task to a binary classification task, solving the multi-label problem.Each training dataset consists of a <sentence, event type> pair representing whether the given sentence conveys an event of type t, with a label of 1 or 0. Taking the public dataset ACE 2005 as an example, there are 33 target event types.Assuming that a sentence contains only one event, it contains 32 negative pairs and one positive pair.It can be seen that the reconstruction of training samples will produce new problems; that is, most sentences express at most two events, resulting in a far higher number of negative samples than positive samples, a phenomenon known as the sample imbalance, which affects the accuracy and recall rate of classification.
In addition to the imbalance of positive and negative samples, the dataset itself has an imbalance problem between easy-to-classify and difficult-to-classify samples.Samples that are easy to classify have a high percentage of correctly classified samples and a low loss function.Samples that are difficult to classify have fewer incorrectly classified samples and a higher loss function.
A large number of easy-to-classify samples will bias the model's overall learning direction.The model can easily identify which events are not contained in the sentence, but it cannot identify the specific types of events contained in the sentence.

Focal Loss Function
The traditional loss function cannot solve the above problems, and the method of increasing the weight of positive samples is adopted to solve the imbalance problem between positive and negative samples.However, it is still unable to process difficult-to-classify samples.For example, when the average loss function is classified incorrectly, it does not update the weight and tends to average.When the classification of the cross-entropy loss function is correct, the weight is not updated and tends to 1.As a result, a new loss function needs to be introduced in order to solve the above two sample imbalance problems concurrently.EDWTDA uses focal loss [35] as the loss function to increase the influence of positive and difficult-to-classify samples on the model.The equation for all training examples is shown in Equation (4).
x is composed of sentences and target event types, y ∈ {0, 1}, o(x (i) ) are model predictions, ||θ|| 2 is the sum of squares of each element in the model, δ > 0 is the weight of the L2 normalization term, β is a parameter to balance the proportion of positive and negative weights of the sample, and γ is a parameter to balance the proportion of hard-to-classify and easy-to-classify weights of samples.
When β = 0.25, γ = 2, focal loss works best.The model concentrates on positive and hard-to-classify samples to improve model prediction.In addition, we use L2 regularization to prevent model overfitting.

Experiment Preparation 4.1.1. Dataset
We perform extensive experimental studies on the ACE 2005 Chinese dataset and the dam safety operation log dataset.The ACE 2005 Chinese dataset contains 633 articles with 8 event types and 33 event subtypes [36].As shown in Table 1, the dam safety operation log dataset contains 1000 reports, consisting of two parts: special inspection reports and daily inspection reports over the years.What can be seen in Table 2 is that it covers 7 event types and 17 argument roles for earthquake, heavy rain, flooding, pre-flood safety inspection, comprehensive special inspection, routine maintenance, and daily inspection.Table 3 shows a comparison of the two datasets.

Case
On 13 August 2018, an M5.0 earthquake occurred in Tonghai County, Yuxi City, Yunnan Province, with a focal depth of 7 km.The earthquake epicenter was about 231 km in a straight line from the Manwan power plant dam, and the Manwan production area experienced slight tremors.In order to grasp the impact of the earthquake on the Manwan power plant's hydraulic buildings, the power plant timely carried out a comprehensive special inspection.In all experiments, 80% of the data was used as the training set, 10% as the validation set, and 10% as the test set.
The distribution of event types in the ACE 2005 Chinese dataset has a long-tail problem, with a small number of events containing a large number of training samples and other small numbers of events containing few training samples, as shown in Figure 4.During the normal operation of the dam, there are many special events, such as flood discharge, earthquakes, and rainstorms.According to the severity of the events, dam operation and maintenance personnel will take response measures, such as arranging routine maintenance and special inspections, to ensure the safe operation of the dam.Therefore, how to effectively detect the occurrence of events from the dam safety operation log and classify them has become an urgent demand of dam operation and maintenance personnel.

Baselines
We compare our model with various baselines, as follows: During the normal operation of the dam, there are many special events, such as flood discharge, earthquakes, and rainstorms.According to the severity of the events, dam operation and maintenance personnel will take response measures, such as arranging routine maintenance and special inspections, to ensure the safe operation of the dam.Therefore, how to effectively detect the occurrence of events from the dam safety operation log and classify them has become an urgent demand of dam operation and maintenance personnel.

Baselines
We compare our model with various baselines, as follows: (1) DMCNN [3]: uses dynamic multi-pooling layers to retain more important information based on event trigger words and arguments; (2) HNN [15]: combines CNN and BiLSTM to capture sequence and block information in specific contexts; (3) HBTNGMA [26]: detects multiple events in a sentence using hierarchical and biased labeling networks, allowing automatic extraction and dynamic fusion of sentencelevel and document-level information; (4) NPN [9]: captures structural and semantic information from characters and words by learning a hybrid representation of each character to solve the word-trigger mismatch problem; (5) TLNN [10]: integrates word and character dynamically to avoid word-trigger mismatch; (6) JMCEE [37]: jointly performs prediction of event trigger words and event arguments based on shared feature representations of pre-trained language models to solve the common role overlap problem in practice; (7) TBNNAM [11]: simulates hidden event trigger words to detect events without trigger words.
DMCNN uses skip-gram to unsupervisedly pre-train word embedding models on the NYT corpus.NPN follows DMCNN's pre-trained model.HNN and TLNN use the skip-gram pre-trained model.JMCEE uses a pre-trained language model for joint event extraction.HBTNGMA and TBNNAM have no pre-trained models.
In this paper, EDWTDA is pre-trained by ALBERT.Section 4.2.3 contains the ablation comparison experiments performed after EDWTDA switched the ALBERT to the skipgram.

Experimental Setup
The experiments in this paper are implemented in the same software and hardware environment, and the specific environment configuration is shown in Table 4. Hyper-parameters are tuned on the validation dataset by grid search.The specific values of hyper-parameters are listed in Table 5. ALBERT produces 312-dimensional word embedding vectors, and the training lookup table generates 200-dimensional entity type embedding vectors and 200-dimensional lexical annotation embedding vectors.The BiL-STM hidden layer size is set to 256, and both the local and global attention network hidden layer sizes are set to 128.The second last layer applies the drop-out layer to avoid overfitting, and the discard ratio is set to 0.5.The model's training batch is set to 16, the number of iterations is 100, and the model is optimized with the Adam optimizer at a learning rate of 0.002.λ adjusts the weighted fusion ratio of local and global attention networks and sets it to 0.44.which can determine the event type directly, so only the latter half of the experimental results are available.In Table 6, EDWTDA outperforms other baselines on both datasets, achieving the best precision, recall, and F1-score.According to the experimental results on the ACE 2005 dataset, EDWTDA improved the accuracy rate, recall rate, and F1-score by 3.40%, 3.90%, and 3.67%, respectively, compared with the optimal baseline JMCEE.And the experimental results on the dam safety operation log dataset prove that EDWTDA improves the accuracy rate, recall rate, and F1-score by 4.18%, 4.58%, and 4.39%, respectively, compared with the optimal baseline JMCEE.The overall effect of the dam safety operation log dataset is better than the ACE 2005 dataset.Because the dam event types are limited and the sentence structure has relatively fixed formats.Compared to public datasets, the meanings are relatively simple and unambiguous, making it easier to determine the event type.Uneven data distribution is an obvious long-tail problem of the ACE 2005 Chinese dataset, as is shown in Figure 4.The results demonstrate that EDWTDA performs well on data-poor event types.
The experimental results also show that DMCNN, as a classical event detection model, not only outperforms HBTNGMA in terms of precision, recall, and F1-Score on both datasets but also has a significant gap with the other one.HNN has the highest precision on the ACE 2005 dataset.HNN combines CNN and BiLSTM, which can better capture context and improve the model's accuracy.However, it has the lowest recall.HBTNGMA is highlighted in gray in the experimental results because it had no pre-trained model during the experiments.HBTNGMA uses documents to detect multiple events in a sentence.And its precision and recall are both improved over DMCNN.However, it does not solve the problem of mismatch caused by word separation in Chinese text, and there is little room for improvement.It is possible to improve the model effect by adding pre-training models such as BERT, which also proves the power of pre-training models in NLP.By fusing character-level and lexical-level information, NPN obtains a hybrid representation that captures internal character composition and accurately classifies event information.NPN attempts to deal with the word-trigger mismatch, but with limited improvement in precision and recall.TLNN focuses on solving the trigger-word mismatch problem by creating a path to link the cell states of all words between the start and end positions of the word.Compared with other baselines, it significantly improves the model's recall.However, TLNN ignores the significance of document context, resulting in limited progress in precision and an overall F1-score.JMCEE focuses on solving the problem of role overlap and uses shared feature representations of pre-trained language models to jointly predict event trigger words and event arguments.It has achieved breakthroughs in accuracy rate, recall rate, and F1-score, but it does not solve the problem of trigger-word mismatch.This results in a dramatic performance degradation in the process from trigger identification to classification.TBNNAM reduces the expensive manual labeling cost and explores event detection without triggers for the first time, which is close to the best baseline model in terms of event detection accuracy.However, skip-gram embedding coding does not consider context, and the MSE loss function vanishes as the bias derivative output probability value approaches 0 or 1. Limited by these two problems, the recall of TBNNAM is low, leading to the model's poor results on the F1-score.
ALBERT reduces the number of parameters in the BERT pre-training model and extracts text features from large-scale unlabeled text data to construct word embedding vectors in context.In addition, ALBERT combines BiLSTM to capture important context and improve the model's predictive power.Consequently, EDWTDA has the best overall experimental results, with significant improvements in precision and recall.
To avoid word-trigger mismatch and improve the model's recall, EDWTDA detects events without trigger words, uses a local attention mechanism to simulate hidden event trigger words, and applies a global attention mechanism to capture essential documents.
To improve the model's generalization ability, EDWTDA converts the event detection task into a binary classification task and makes up for the imbalance between positive and negative samples by means of the focal loss function.

Dual Attention Fusion Ratio Analysis
EDWTDA's core components are local and global attention.Local attention captures key semantic information in sentences and simulates hidden event trigger words by giving higher attention scores to keywords.However, it does not avoid the ambiguity of trigger words.Global attention resolves the issue of trigger word ambiguity and aids in determining event types by mining the rich semantics in documents, while it dilutes the fraction of attention words and is vulnerable to noisy data in documents.Local and global attention alone cannot solve all the problems.It is quite critical to set an appropriate dual attention fusion ratio.In this paper, we observe the trend of the F1-score of EDWTDA with the value of λ, and take the highest F1-score as the value of λ. Figure 5 shows that the curve overall presents a trend of first rising and then falling and reaches its peaks at 0.44 (i.e., the final value of λ), with a difference of about 1.5% between the two boundary points.Given the information mentioned above, it is clear that gradually increasing the proportion of local attention can help the model focus on the key words in the sentence and improve model classification accuracy.When the ratio reaches 0.44, the local attention enhancement capacity reaches its peak, after which it only weakens the ability to focus global attention.

Ablation Analysis
EDWTDA is made up of four major components: the word embedding layer, the feature construction layer, the dual attention mechanism layer, and the focal loss function.To figure out how much each of these four components influences the model, one of them is removed or replaced at a time, and the performance is compared with the original model.
Table 7 shows that the dual attention mechanism layer has the greatest impact.The bold part represents the optimal experimental results.Removing this layer decreases the F1-score by 6.15% and 6.25%, respectively, on the two datasets, which indicates that this layer is the core component of event detection without trigger words.The dual attention mechanism layer raises the attention weight of the event trigger word within the sentence as well as that of the sentence within the document.It allows the model to pay more attention to the location of the event trigger word without special markers and improves event detection precision and recall.The F1-score decreases by 2.99% and 3.19%, respectively, on the two datasets after changing the word embedding layer from ALBERT to skip-gram.It can be seen that the word embedding layer is second only to the dual attention mechanism layer.This demonstrates that ALBERT can dynamically learn context and break the limitation that skip-gram cannot represent polysemous words.Meanwhile, the word embedding vector represented by ALBERT contains richer meanings, contributing to the determination of event type.With focal loss replaced by cross-entropy, the F1-score decreases by 2.33% and 2.39% on the two datasets, respectively, showing that focal loss can reduce the sample imbalance and improve the model's generalization ability.The feature construction layer has the least impact.After removing it, the F1-score only decreases by 0.48% and 0.43%.It means that lexical annotation and entity labeling have a subtle influence on model performance but do help identify trigger words.Trigger words are mainly verbs or nouns; entity labels are mostly O.The feature construction layer can filter

Ablation Analysis
EDWTDA is made up of four major components: the word embedding layer, the feature construction layer, the dual attention mechanism layer, and the focal loss function.To figure out how much each of these four components influences the model, one of them is removed or replaced at a time, and the performance is compared with the original model.
Table 7 shows that the dual attention mechanism layer has the greatest impact.The bold part represents the optimal experimental results.Removing this layer decreases the F1-score by 6.15% and 6.25%, respectively, on the two datasets, which indicates that this layer is the core component of event detection without trigger words.The dual attention mechanism layer raises the attention weight of the event trigger word within the sentence as well as that of the sentence within the document.It allows the model to pay more attention to the location of the event trigger word without special markers and improves event detection precision and recall.The F1-score decreases by 2.99% and 3.19%, respectively, on the two datasets after changing the word embedding layer from ALBERT to skip-gram.It can be seen that the word embedding layer is second only to the dual attention mechanism layer.This demonstrates that ALBERT can dynamically learn context and break the limitation that skip-gram cannot represent polysemous words.Meanwhile, the word embedding vector represented by ALBERT contains richer meanings, contributing to the determination of event type.With focal loss replaced by cross-entropy, the F1-score decreases by 2.33% and 2.39% on the two datasets, respectively, showing that focal loss can reduce the sample imbalance and improve the model's generalization ability.The feature construction layer has the least impact.After removing it, the F1-score only decreases by 0.48% and 0.43%.It means that lexical annotation and entity labeling have a subtle influence on model performance but do help identify trigger words.Trigger words are mainly verbs or nouns; entity labels are mostly O.The feature construction layer can filter parts of the words in the sentence and improve the model's prediction ability.Each sentence may contain any number of events, which means it can have zero or multiple target labels.EDWTDA transforms event detection from a multi-classification task to a binary one in order to solve the multi-label problem and uses the focal loss function to deal with the sample imbalance caused by the transformation.
In order to prove that the effect of binary classification is better than that of multiclassification, this paper transformed EDWTDA into multi-classification form (EDWTDA-MC) and compared the performance of the two models from three aspects: accuracy rate (P), recall rate (R), and F1-score (F1), as shown in Figure 6. Figure 6 shows that EDWTDA significantly outperforms EDWTDA-MC in all aspects, especially the recall, which is much higher than EDWTDA-MC, possibly because EDWTDA-MC predicts at most one event for each sentence.In conclusion, it is effective to convert event detection into binary classification, which can solve the multi-label problem and improve classification accuracy.Each sentence may contain any number of events, which means it can have zero or multiple target labels.EDWTDA transforms event detection from a multi-classification task to a binary one in order to solve the multi-label problem and uses the focal loss function to deal with the sample imbalance caused by the transformation.
In order to prove that the effect of binary classification is better than that of multiclassification, this paper transformed EDWTDA into multi-classification form (EDWTDA-MC) and compared the performance of the two models from three aspects: accuracy rate (P), recall rate (R), and F1-score (F1), as shown in Figure 6. Figure 6 shows that EDWTDA significantly outperforms EDWTDA-MC in all aspects, especially the recall, which is much higher than EDWTDA-MC, possibly because EDWTDA-MC predicts at most one event for each sentence.In conclusion, it is effective to convert event detection into binary classification, which can solve the multi-label problem and improve classification accuracy.

Experiment Summary
Extensive experiments on the ACE 2005 dataset and the dam safety operation log dataset show that EDWTDA performs better than all current baselines.The experimental results from the ACE 2005 dataset show that EDWTDA improves precision, recall, and F1-score by 3.40%, 3.90%, and 3.67%, respectively, over the best baseline JMCEE.From the experimental results on the dam safety operation log dataset, it can be seen that EDWTDA improves by 4.18%, 4.58%, and 4.39% in precision, recall, and F1-score, respectively, over the best baseline JMCEE.The comparison experiments verify that EDWTDA can solve the polysemous problem and the word-trigger mismatch problem, as well as improve event detection precision and recall.
Ablation analysis confirms that the word embedding vectors generated by ALBERT are rich in semantics.ALBERT can solve the problem of polysemy and improve the model's prediction ability.It has been demonstrated that lexical annotations and entity labels have filtering functions to improve the model's prediction ability.The dual attention mechanism can capture key semantics in sentences and documents, which can then simulate hidden event trigger words and solve the word-trigger mismatch problem.Meanwhile, the polysemous problem can be solved by using document context to improve the model's recall.It is verified that the focal loss function can fix the sample imbalance and improve the model's generalization ability.The classification number analysis confirms that the binary classification approach can solve the multi-label problem and improve classification accuracy.

Conclusions
In this paper, we propose EDWTDA for dealing with polysemy and the mismatch of word triggers.EDWTDA uses the dual attention mechanism to capture key semantics in sentences and documents, fusing event types to simulate hidden event trigger words.It performs event detection without trigger words and avoids the word-trigger mismatch.At the same time, EDWTDA solves the polysemous problem with the context contained in the document.Finally, the effectiveness of each module of EDWTDA is verified through experimental analysis.EDWTDA significantly improves event detection performance, proving that event detection can work well without trigger words.
In future work, we will conduct experiments on more languages with and without explicit word separators.In addition, we will try developing a dynamic mechanism to selectively consider the semantic information rather than take all the senses of characters and words into account.

Figure 2 .
Figure 2.An example of trigger-word mismatch.

Figure 2 .
Figure 2.An example of trigger-word mismatch.

Figure 3 .
Figure 3.The architecture of event detection without triggers based on dual attention.Figure 3. The architecture of event detection without triggers based on dual attention.

Figure 3 .
Figure 3.The architecture of event detection without triggers based on dual attention.Figure 3. The architecture of event detection without triggers based on dual attention.

Figure 4 .
Figure 4. Distribution of the number of event types in the ACE 2005 Chinese dataset.

Figure 4 .
Figure 4. Distribution of the number of event types in the ACE 2005 Chinese dataset.

18 Figure 5 .
Figure 5.The F1-score of EDWTDA varies with the value of  .

Figure 5 .
Figure 5.The F1-score of EDWTDA varies with the value of λ.

4. 3 .
Experiment SummaryExtensive experiments on the ACE 2005 dataset and the dam safety operation log dataset show that EDWTDA performs better than all current baselines.The experimental results from the ACE 2005 dataset show that EDWTDA improves precision, recall, and

Table Map Global Attention t1 t2 ALBERT Embedding Layer Input Layer Feature Construction Layer BiLSTM Layer Local Attention Layer Fusion Gate Layer Sigmoid Layer
地震(事件类型)

Table 1 .
Example of dam safety operation log.

Table 2 .
Event types and corresponding event arguments in the dam safety operation log dataset.

Table 3 .
Comparison of two datasets.