Addressing Long-Distance Dependencies in AMR Parsing with Hierarchical Clause Annotation

: Most natural language processing (NLP) tasks operationalize an input sentence as a sequence with token-level embeddings and features, despite its clausal structure. Taking abstract meaning representation (AMR) parsing as an example, recent parsers are empowered by transformers and pre-trained language models, but long-distance dependencies (LDDs) introduced by long sequences are still open problems. We argue that LDDs are not actually to blame for the sequence length but are essentially related to the internal clause hierarchy. Typically, non-verb words in a clause cannot depend on words outside of it, and verbs from different but related clauses have much longer dependencies than those in the same clause. With this intuition, we introduce a type of clausal feature, hierarchical clause annotation (HCA), into AMR parsing and propose two HCA-based approaches, HCA-based self-attention (HCA-SA) and HCA-based curriculum learning (HCA-CL), to integrate HCA trees of complex sentences for addressing LDDs. We conduct extensive experiments on two in-distribution (ID) AMR datasets (AMR 2.0 and AMR 3.0) and three out-of-distribution (OOD) ones (TLP, New3, and Bio). Experimental results show that our HCA-based approaches achieve signiﬁcant and explainable improvements (0.7 Smatch score in both ID datasets; 2.3, 0.7, and 2.6 in three OOD datasets, respectively) against the baseline model and outperform the state-of-the-art (SOTA) model (0.7 Smatch score in the OOD dataset, Bio) when encountering sentences with complex clausal structures that introduce most LDD cases.

Abstract meaning representation (AMR) parsing [12], or translating a sentence to a directed acyclic semantic graph with relations among abstract concepts, has made strides in counteracting LDDs in different approaches.In terms of transition-based strategies, Peng et al. [13] propose a cache system to predict arcs between distant words.In graphbased methods, Cai and Lam [14] present a graph ↔ sequence iterative inference to overcome inherent defects of the one-pass prediction process in parsing long sentences.
In seq2seq-based approaches, Bevilacqua et al. [15] employ the Transformer-based pretrained language model, BART [16], to address LDDs in long sentences.Among these categories, seq2seq-based approaches have become mainstream, and recent parsers [17][18][19][20] employ the seq2seq architecture with the popular codebase SPRING [15], achieving better performance.Notably, HGAN [20] integrates token-level features, syntactic dependencies (SDP), and SRL with heterogeneous graph neural networks and has become the state-ofthe-art (SOTA) in terms of removing extra silver training data, graph-categorization, and ensemble methods.
However, these AMR parsers still suffer performance degradation when encountering long sentences with deeper AMR graphs [18,20] that introduce most LDD cases.We argue that the complexity of the clausal structure inside a sentence is the essence of LDDs, where clauses are the core units of grammar and center on a verb that determines the occurrences of other constituents [21].Our intuition is that non-verb words in a clause typically cannot depend on words outside, while dependencies between verbs correspond to the inter-clause relations, resulting in LDDs across clauses [22].
To prove our claim, we demonstrate the AMR graph of a sentence from the AMR 2.0 dataset (https://catalog.ldc.upenn.edu/LDC2017T10,accessed on 11 June 2022) and distinguish the AMR relation distances in terms of different segment levels (clause/phrase/token) in Figure 1.Every AMR relation is represented as a dependent edge between two abstract AMR nodes that align to one or more input tokens.The dependency distances of inter-token relations are subtractive results from the indices of tokens aligned to the source and target nodes, while those of inter-phrase and inter-clause relations are calculated by indices of the headwords in phrases and the verbs in clauses, respectively.(The AMR relation distances between a main clause and a relative/appositive clause are decided by the modified noun phrase in the former and the verb in the latter).As can be observed: • Dependency distances of inter-clause relations are typically much longer than those of inter-phrase and inter-token relations, leading to most of the LDD cases.For example, the AMR relation anxious :ARG0−o f − −−−−− → go-01, occurring in the clause "I get very anxious", and its relative clause "which does sort of go away . . .", has a dependency distance of 6 (subtracting the 9th token "anxious" from the 15th token "go").

•
Reentrant AMR nodes abstracted from pronouns also lead to far distant AMR relations.
For example, the AMR relation wait-01 −−−→ I has a dependency distance of 33 (subtracting the 1st token "I" from the 34th token "wait").Sentence If I do not check, I get very anxious, which does sort of go away after 15-30 mins, but often the anxiety is so much that I can not wait that long.The input sentence is placed at the (bottom), and the sentence's clause/phrase/token-level segments are positioned in the (middle) along with the token indices.The corresponding AMR graph is displayed at the (top), where AMR relations are represented as directed edges with a dependency distance, i.e., the indices' subtraction of two tokens mapping to the source/target AMR nodes.Interclause/phrase/token relations are distinguished in separate colors, corresponding to the segment levels' colors.Note that two virtual AMR nodes in dashed boxes of the reentrant node "I" are added for simplicity.
Based on the findings above, we are inspired to utilize the clausal features of a sentence to cure LDDs.Rhetorical structure theory (RST) [23] provides a general way to describe the coherence relations among clauses and some phrases, i.e., elementary discourse units, and postulates a hierarchical discourse structure called discourse tree.Except for RST, a novel clausal feature, hierarchical clause annotation (HCA) [24], also captures a tree structure of a complex sentence, where the leaves are segmented clauses and the edges are the inter-clause relations.
Due to the better parsing performances of the clausal structure [24], we select and integrate the HCA trees of complex sentences to cure LDDs in AMR parsing.Specifically, we propose two HCA-based approaches, HCA-based self-attention (HCA-SA) and HCA-based curriculum learning (HCA-CL), to integrate the HCA trees as clausal features in the popular AMR parsing codebase, SPRING [15].In HCA-SA, we convert an HCA tree into a clause adjacency matrix and a token visibility matrix to restrict the attention scores between tokens from unrelated clauses and increase those from related clauses in masked-self-attention encoder layers.In HCA-CL, we employ curriculum learning with two training curricula, Clause-Number and Tree-Depth, with the assumption that "the more clauses or the deeper clausal tree in a sentence, the more difficult it is to learn".
We conduct extensive experiments on two in-distribution (ID) AMR datasets (i.e., AMR 2.0 and AMR 3.0 (https://catalog.ldc.upenn.edu/LDC2020T02,accessed on 11 June 2022)) and three out-of-distribution (OOD) ones (i.e., TLP, New3, and Bio) to evaluate our two HCA-based approaches.In ID datasets, our parser achieves a 0.7 Smatch F1 score improvement against the baseline model, SPRING, on both AMR 2.0 and AMR 3.0, and outperforms the SOTA parser, HGAN, by 0.5 and 0.6 F1 scores for the fine-grained metric SRL in the two datasets.Notably, as the clause number of the sentence increases, our parser outperforms SPRING by a large margin and achieves better Smatch F1 scores than HGAN, indicating the ability to cure LDDs.In OOD datasets, the performance boosts achieved by our HCA-based approaches are more evident in complicated corpora like New3 and Bio, where sentences consist of more clauses and longer clauses.Our code is publicly available at https://github.com/MetroVancloud/HCA-AMRparsing(accessed on 3 August 2023).
The rest of this paper is organized as follows.The related works are summarized in Section 2, and the proposed approaches are detailed in Section 3.Then, the experiments of AMR parsing are presented in Section 4. Next, the discussion of the experimental results is presented in Section 5. Finally, our work is concluded in Section 6.

Related Work
In this section, we first introduce the open problem of LDDs and some universal methods.Then, we summarize the four main categories of parsers and the LDD cases in AMR parsing.Finally, we introduce the novel clausal feature, HCA.

Long-Distance Dependencies
LDDs, first proposed by Hockett [25], describe an interaction between two (or more) elements in a sequence separated by an arbitrary number of positions.LDDs are related to the rate of decay of statistical dependence of two points with increasing time intervals or spatial distances between them [26].In recent linguistic research into LDDs, Liu et al. [22] propose the verb-frame frequency account to robustly predict acceptability ratings in sentences with LDDs, indicating the affinities between the number of verbs and LDDs in sentences.
In recent advances in NLP tasks, hierarchical recurrent neural networks [7], LSTM [8], attention mechanism [9], Transformer [10], and implicit graph neural networks [11] are proposed to cure LDDs.Specifically, the attention mechanism has been showcased with successful applications to address LDDs in diverse environments.For instance, Xiong and Li [27] designed an attention mechanism to enable the neural model to learn relevant and informative words that contain topic-related information in students' constructed response answers.It solved long-distance dependencies by focusing on specific parts of the input sequence that are most relevant to the task at hand.Zukov-Gregoric et al. [28] proposed a self-attention mechanism in the multi-head encoder-decoder neural network architecture that allows the network to focus on important resolution information from long writings, enabling it to perform better in named entity recognition.Li et al. [29] also employed a bidirectional LSTM model with a self-attention mechanism to enhance the sentiment information derived from existing linguistic knowledge and sentiment resources in the sentiment analysis task.
These universal neural models all represent the input sequence with token-level embeddings from pretrained language models in most NLP tasks.

AMR Parsing
AMR parsing is a challenging semantic parsing task since AMR is a deep semantic representation consisting of many special annotations (e.g., abstract concept nodes, named entities, co-references, and such) [12].The aim of AMR parsing is translating a sentence to a directed acyclic semantic graph with relations among abstract concepts, where the two main characteristics are:

1.
Abstraction: Assigns the same AMR to sentences with the same basic meaning and also brings a challenge for alignments between input tokens and output AMR nodes [12], e.g., the token "can" and its corresponding AMR node possible-01 in Figure 1.

2.
Reentrancy: Introduces the presence of nodes with multiple parents and represents sentences as graphs rather than trees [30], causing some LDD cases, e.g., the AMR node "I" in Figure 1.
Existing AMR parsers can be summarized into four categories: 1.

2.
Seq2seq-based: Model the task as transduction of the sentence into a linearization of the AMR graph [15,[17][18][19][20][35][36][37][38].SPRING [15] is a popular seq2seq-based codebase that employs transfer learning by exploiting a pretrained encoder-decoder model, BART [16], to generate a linearized graph incrementally with a single auto-regressive pass of a seq2seq decoder.The subsequent models, ANCES [17], HCL [18], ATP [19], and HGAN [20] all follow the architecture of SPRING, and HGAN integrates SDP and SRL features with heterogeneous graph neural networks to achieve the best performance in the tasks of removing extra silver training data, graph re-categorization, and ensemble methods.
Despite different architectures, most AMR parsers employ word embeddings from pretrained language models and utilize token-level features like part-of-speech (POS), SDP, and SRL.However, these parsers still suffer performance degradation [18,20] when parsing complex sentences with LDDs due to the difficulties of aligning input tokens with output AMR nodes in such a long sequence.

Hierarchical Clause Annotation
RST [23] provides a general way to describe the coherence relations among parts in a text and postulates a hierarchical discourse structure called a discourse tree.The leaves of a discourse tree can be a clause or a phrase without strict definitions, known as elementary discourse units.However, the performances of RST parsing tasks are unsatisfactory due to the loose definitions of elementary discourse units and abundant types of discourse relations [51,52].
Syntactic dependency parse trees (SDPT) and constituency parse trees (CPT) are existing syntactic representations that provide hierarchical token-level annotations.However, AMR has the unique feature, abstraction, summarized in Section 2.2, indicating the challenge of alignments between input tokens and output AMR nodes.When encountering long and complex sentences, the performance of parsing SDPT [1] and CPT [2] degrade, and these silver token-level annotations also contribute more noise for "token-node" alignments in AMR parsing.
In addition to RST, Fan et al. [24] also propose a novel clausal feature, HCA, which represents a complex sentence as a tree consisting of clause nodes and inter-clause relation edges.The HCA framework is based on English grammar [21], where clauses are elementary grammar units that center around a verb, and inter-clause relations can be classified into two categories: 1.
Coordination: An equal relation shared by clauses with the same syntactic status, including And, Or, and But relations.

2.
Subordination: Occurs in a matrix and a subordinate clause, including Subjective, Objective, Predicative, Appositive, Relative, and nine sublevel Adverbial relations.
Inter-clause relations have appropriate alignments with AMR relations, where nominal clause relations correspond to the frame arguments (e.g., Subjective vs. :ARG0), and adverbial clause relations are mapped to general semantic relations (e.g., Adverbial_of_Condition vs. :condition).Figure 2 demonstrates the segmented clauses and the HCA tree of the same sentence in Figure 1.Based on well-defined clauses and inter-clause relations, Fan et al. provide a manually annotated HCA corpus for AMR 2.0 and high-performance neural models to generate silver HCA trees for more complex sentences.Therefore, we select and utilize the HCA trees as clausal features to address LDDs in AMR parsing.

Model
In this paper, we propose two HCA-based approaches, HCA-SA (Section 3.1) and HCA-CL (Section 3.2), to integrate HCA trees in the popular AMR parser codebase SPRING for addressing LDDs in AMR parsing.

HCA-Based Self-Attention
Existing AMR parsers (e.g., SPRING) employ Transformer [10] as an encoder to obtain the input sentence representation.However, in the standard self-attention mechanism adopted by the Transformer model, every token needs to attend to all other tokens, and the learned attention matrix A is often very sparse across most data points [53].Inspired by the work of Liu et al. [54], we propose the HCA-SA approach, which utilizes hierarchical clause annotations as a structural bias to restrict the attention scope and attention scores.We summarize this method in Algorithm 1.

Algorithm 1 HCA-Based Self-Attention
Require: Attention head number h, attention matrices Q, K, and V, token visibility matrices M vis and M deg , matrix that maps the attention head indices to clause relations M rel Ensure: Multi-head attention weights A MultiHead with embedded HCA features 1: Initialization: Multi-head attention weights A MultiHead , attention layer index i ← 0 2: repeat 3: mask the attention scores with HCA 6:

Token Visibility Matrix
For the example sentence in Figure 2, its HCA tree can be transferred into a clause adjacency matrix in Figure 3a by checking if two clauses have an inter-clause edge (not meaning that they are adjacent in the source sentence), where adjacent clauses have specific correlations (pink) and non-adjacent ones share no semantics (white).Each clause has the strongest correlation with itself (red).
Furthermore, we transform the clause adjacency matrix into token visibility matrices by splitting every clause into tokens.As shown in Figure 3b,c, the visibility between tokens can be summarized into the following cases: Global Visibility: tokens with a pronoun POS (e.g., "I" in C 5 ) are globally visible for the linguistic phenomena of co-reference; • Additional Visibility: tokens that are clausal keywords (i.e., coordinators, subordinators, and antecedents) share additional visibilities with the tokens in adjacent clauses (e.g., "if" in C 1 to tokens in C 2 ).
Therefore, we introduce two token visibility matrices M vis and M deg , where the former signals whether two tokens are visible mutually, and the latter measures the visibility degree between them: where ⊗, , and mean that the corresponding two tokens, w i and w j , are positioned in two nonadjacent (No Visibility), similar (Full Visibility), and adjacent (Partial Visibility) clauses, respectively.Key(w i /w j ) indicates that at least one of w i and w j is a clausal keyword token, while PRN(w i /w j ) denotes the existence of at least one pronoun in w i and w j .Values of the hyperparameters λ 0 , λ 1 , and µ are in (0, 1), and λ 1 > λ 0 .
< l a t e x i t s h a 1 _ b a s e 6 4 = " I w        .Overview of our hierarchical clause annotation (HCA)-based self-attention approach that integrates the clausal structure of input sentences.In (a), red blocks mean that clauses have the strongest correlation with themselves; the pink/white ones mean that the corresponding two are adjacent/non-adjacent in the HCA tree.In (b,c), the adjacency between two clauses is concretized in a token visibility matrix.Pink circles with a red dotted border mean one of the two corresponding tokens is a pronoun, while those with a blue solid border indicate the existence of a clausal keyword (i.e., coordinator, subordinator, or antecedent).

Masked-Self-Attention
To some degree, the token visibility matrices M vis and M deg contain the structural information of the HCA tree.For vanilla transformers employed in existing AMR parsers, stacked self-attention layers inside can not receive M vis and M deg as inputs, so we modify this to Masked-Self-Attention, which can restrict the attention scope and attention scores according to M vis and M deg .Formally, the masked attention scores S mask and the masked attention matrix A mask are defined as: where self-attention inputs are Q, K, V ∈ R N×d ; N is the length of the input sentences; and scaling factor d is the dimension of the model.Intuitively, if token w i is invisible to w j , the attention scores S mask i,j will be masked to 0 due to the value −∞ of M vis i,j and the value 0 of M deg i,j .Otherwise, S mask i,j will be scaled according to M deg i,j in different cases.

Clause-Relation-Bound Attention Head
In every stacked self-attention layer of Transformer, multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions [10].It also provides us the possibility of integrating different inter-clause relations in different attention heads.Instead of masking the attention matrix with non-labeled HCA trees, we propose a clause-relation-bound attention head setting, where every head attends to a specific inter-clause relation.
In this setting, we increase the visibility between tokens in two adjacent clauses with interrelation rel i in the attention matrix A mask rel i of the bound head, i.e., increasing λ 0 to 1 in Equation (2).Therefore, the final attention matrix A MultiHead of each stacked self-attention layer is defined as: where the parameter matrix W O ∈ R hN×d , and h is the attention head number mapping to 16 inter-clause relations in HCA [24].

HCA-Based Curriculum Learning
Inspired by the idea of curriculum learning [55], which suggests that humans handle difficult tasks from easy examples to hard ones, we propose an HCA-CL approach for training an AMR parser, in which the clause number and the tree depth of a sentence's HCA are the measurements of learning difficulty.Referring to the previous work of Wang et al. [18], we set two learning curricula, Clause-Number and Tree-Depth, in our HCA-CL approach demonstrated in Figure 4.

Clause-Number Curriculum
Training Episodes

Tree Depth
Tree-Depth Curriculum

Clause-Number Curriculum
In the Clause-Number (CN) curriculum, sentences with more clauses that involve more inter-clause relations and longer dependency distances (demonstrated in Figure 1) are considered to be harder to learn.Given this assumption, all training sentences are divided into N buckets according to their clause number {C i : i = 1, . . ., N}, where C i contains sentences with the clause number i.In the training epoch of the Clause-Number curriculum, there are N training episodes with T cn steps individually.At each step of the i-th episode, the training scheduler samples a batch of examples from buckets {I j : j ≤ i} to train the model.

Tree-Depth Curriculum
In the Tree-Depth (TD) curriculum, sentences with deeper HCA trees that correspond to deeper hierarchical AMR graphs are considered to be harder to learn.Given this assumption, all training sentences are divided into M buckets according to their clause number {D i : i = 1, . . ., M}, where D i contains sentences with the clause number i.

Experiments
In this section, we describe the details of datasets, environments, model hyperparameters, evaluation metrics, compared models, and parsing results for the experiments.

Datasets
For the benchmark datasets, we choose two standard AMR datasets, AMR 2.0 and AMR 3.0, as the ID settings and three test sets, TLP, New3, and Bio, as the OOD settings.
For the HCA tree of each sentence, we use the manually annotated HCA trees for AMR 2.0 provided by [24] and auto-annotated HCA trees for the remaining datasets, which were all generated by the HCA Segmenter and the HCA Parser proposed by [24].The training, development, and test sets in both datasets are a random split, and therefore we take them as ID datasets as in previous works [15,17,18,20,45].

Out-of-Distribution Datasets
To further estimate the effects of our HCA-based approaches on open-world data that come from a different distribution, we follow the OOD settings introduced by [15] and predict based on three OOD test sets with the parser trained on the AMR 2.0 training set:

Hierarchical Clause Annotations
For the hierarchical clausal features utilized in our HCA-based approaches, we use the manually annotated HCA corpus for AMR 2.0 provided in [24].Moreover, we employ the HCA segmenter and the HCA parser proposed in [24] to generate silver HCA trees for AMR 3.0 and three OOD test sets.Detailed statistics of the evaluation datasets in this paper are listed in Table 1.
Table 1.Main statistics of five AMR parsing benchmarks."ID" and "OOD" denote in-distribution and out-of-distribution settings, respectively."#Snt."and "#HCA" represent the total number of sentences and complex sentences with hierarchical clause annotations in each split set.Since SPRING provides a clear and efficient seq2seq-based architecture based on a vanilla BART, recent seq2seq-based models HCA, ANCES, and HGAN all select it as the codebase.Therefore, we also choose SRPING as the baseline model to apply our HCAbased approaches.Additionally, we do not take the competitive AMR parser, ATP [19], into consideration for our compared models since it employs syntactic dependency parsing and semantic role labeling as intermediate tasks to introduce extra silver training data.

Hyper-Parameters
For the hyper-parameters of our HCA-based approaches, we list their layer, name, and value in Table 2. To pick the hyper-parameters employed in the HCA-SA Encoder, i.e., λ 0 , λ 1 , and µ, we use a random search with a total of 16 trials in their search spaces (λ 0 : [0.1, 0.6], λ 1 : [0.4,0.9], µ: [0.7, 1.0], and stride +0.1).According to the results of these experimental trials, we selected the final pick for each hyper-parameter.All models are trained until reaching their maximum epochs, and then we select the best model checkpoint on the development set.

Evaluation Metrics
Following previous AMR parsing works, we use Smatch scores [57] and fine-grained metrics [58] to evaluate the performances.Specifically, the fine-grained AMR metrics are: 1.
Unlabeled (Unlab.):Smatch computed on the predicted graphs after removing all edge labels.

•
Unlab.does not consider any edge labels and only considers the graph structure; • Reent. is a typical structure feature for the AMR graph.Without reentrant edges, the AMR graph is reduced to a tree; • SRL denotes the core-semantic relation of the AMR, which determines the core structure of the AMR.
Conversely, all other metrics are classified as structure-independent metrics.

Experimental Environments
Table 3 lists the information on the main hardware and software used in our experimental environments.Note that the model in AMR 2.0 is trained for a total of 30 epochs for 16 h, while the model trained in AMR 3.0 is finished in a total of 30 epochs for 28 h given the experimental environments.

Experimental Results
We now report the AMR parsing performances of our HCA-based parser and other comparison parsers on ID datasets and OOD datasets, respectively.

Results in ID Datasets
As demonstrated in Table 4, we report AMR parsing performances of the baseline model (SPRING), other compared parsers, and the modified SPRING that applies our HCA-based self-attention (HCA-SA) and curriculum learning (HCA-CL) approaches on ID datasets AMR 2.0 and AMR 3.0.All the results of our HCA-based model have averaged scores of five experimental trials, and we compute the significance of performance differences using the non-parametric approximate randomization test [59].From the results, we can make the following observations: • Equipped with our HCA-SA and HCA-CL approaches, the baseline model SPRING achieves a 0.7 Smatch F1 score improvement on both AMR 2.0 and AMR 3.0.The improvements are significant, with p < 0.005 and p < 0.001, respectively.• In AMR 2.0, our HCA-based model outperforms all compared models except ANCES and the HGAN version that introduces both DP and SRL features.

•
In AMR 3.0, consisting of more sentences with HCA trees, the performance gap between our HCA-based parser and the SOTA (HGAN with DP and SRL) is only a 0.2 Smatch F1 score.
To better analyze how the performance improvements of the baseline model are achieved when applying our HCA-based approaches, we also report structure-dependent fine-grained results in Table 4.As claimed in Section 1, inter-clause relations in the HCA can bring LDD issues, which are typically related to AMR concept nodes aligned with verb phrases and reflected in structure-dependent metrics.As can be observed:

•
Our HCA-based model outperforms the baseline model in nearly all fine-grained metrics, especially in structure-dependent metrics, with 1.1, 1.8, and 3.9 F1 score improvements in Unlab., Reent., and SRL, respectively.• In the SRL, Conc., and Neg metrics, our HCA-based model achieves the best performance against all compared models.

Results in OOD Datasets
As demonstrated in Table 5, we report the parsing performances of our HCA-based model and compared models on three OOD datasets.As can be seen:

•
Our HCA-based model outperforms the baseline model SPRING with 2.5, 0.7, and 3.1 Smatch F1 score improvements in the New3, TLP, and Bio test sets, respectively.

•
In the New3 and Bio datasets that contain long sentences of newswire and biomedical texts and have more HCA trees, our HCA-based model outperforms all compared models.

•
In the TLP dataset that contains many simple sentences of a children's story and fewer HCA trees, our HCA-based does not perform as well as HCL and HGAN.

Discussion
As shown in the previous section, our HCA-based model achieves prominent improvements against the baseline model, SPRING, and outperforms other compared models, including the SOTA model HGAN in some fine-grained metrics in ID and ODD datasets.In this section, we further discuss the paper's main issue of whether our HCA-based approaches have any effect on curing LDDs.Additionally, the ablation studies and the case studies are also provided.

Effects on Long-Distance Dependencies in ID Datasets
As claimed in Section 1, most LDD cases occur in sentences with complex hierarchical clause structures.In Figure 5, we demonstrate the parsing performance trends of the baseline model SPRING, the SOTA parser HGAN (we only use the original data published in their paper to draw performance trends over the number of tokens, without performances in terms of the number of clauses), and our HCA-based model over the number of tokens and clauses in sentences from AMR 2.0.It can be observed that:

•
When the number of tokens (denoted as #Token for simplicity ) >20 in a sentence, the performance boost of our HCA-based model against the baseline SPRING gradually becomes significant.

•
For the case #Token > 50 that indicates sentences with many clauses and inter-clause relations, our HCA-based model outperforms both SPRING and HGAN.

•
When compared with performances trends for #Clause, the performance lead of our HCA-based model against SPRING becomes much more evident as #Clause increases.

SmatchF1%
#Token To summarize, our HCA-based approaches show significant effectiveness on long sentences with complex clausal structures that introduce most LDD cases.

Effects on Long-Distance Dependencies in OOD Datasets
As the performance improvements achieved by our HCA-based approaches are much more prominent in OOD datasets than in ID datasets, we further explore the OOD datasets with different characteristics.
Figure 6 demonstrates the two main statistics of three OOD datasets, i.e., the average number of clauses per sentence (denoted as #C/S) and the average number of tokens per clause (#T/C).These two statistics of the datasets both characterize the complexity of the clausal structure inside a sentence, where • #C/S shows the number of complex sentences with more than one clause; • #T/C depicts the latent dependency distance between two tokens from different clauses.
We also present the performance boosts of our HCA-based parser against SPRING in Figure 6.As can be observed, the higher the values of #C/S and #C/S in an OOD dataset, the higher the Smatch improvements that are achieved by our HCA-based approaches.Specifically, New3 and Bio seem to cover more complex texts from newswire and biomedical articles, while TLP contains simpler sentences that are easy for children to read.Therefore, our AMR parser performs much better on complex sentences from Bio and New3, indicating the effectiveness of our HCA-based approaches on LDDs.

Ablation Study
In the HCA-SA approach, two token visibility matrices derived from HCA trees are introduced to mask certain attention heads.Additionally, we propose a clause-relationbound attention head setting to integrate inter-clause relations in the encoder.Therefore, we conduct ablation studies by introducing random token visibility matrices (denoted as "w/o VisMask") and removing the clause-relation-bound attention setting (denoted as "w/o ClauRel").Note that "w/o VisMask" contains the case of "w/o ClauRel" because the clause-relation-bound attention setting is based on the masked-self-attention mechanism.In the HCA-CL approach, extra training epochs for Clause-Number and Tree-Depth curricula serve as a warm-up stage for the subsequent training process.To eliminate the effect of the extra epochs, we also add the same number of training epochs to the ablation study of our HCA-CL approach.

AverageNumberofTokensperClause AverageNumberofClausesperSentence
As shown in Table 6: • In HCA-SA, the clause-relation-bound attention setting (denoted as "ClauRel") contributes most in the SRL metric due to the mappings between inter-clause relations (e.g., Subjective and Objective) and SRL-type AMR relations (e.g., :ARG0 and :ARG1).

•
In HCA-SA, the masked-self-attention mechanism (denoted as "VisMask") achieves significant improvements in the Reent.metric by increasing the visibility of pronoun tokens to all tokens.

•
In HCA-CL, the Tree-Depth curriculum (denoted as "TD") has no effects on the parsing performances.We conjecture that sentences with much deeper clausal structures are rare, and the number of split buckets for the depth of clausal trees is not big enough to distinguish the training sentences.

Case Study
To further demonstrate the effectiveness of our HCA-based approaches on LDDs in AMR parsing, we compare the output AMR graphs of the same example sentence in Figure 1 parsed by the baseline model SPRING and by the modified SPRING that applies our HCA-SA and HCA-CL approaches (denoted as Ours), respectively, in Figure 7.
For SPRING, it mislabels node "go-02" in subgraph G 3 as the :ARG1 role of node "contrast-01".Then, it fails to realize that it is "anxious" in G 2 that takes the :ARG1 role of "go-02" in G 3 .Additionally, the causality between G 4 and G 5 is not interpreted correctly due to the absence of node "cause-01" and its arguments.
In contrast, when integrating the HCA, Ours seems to understand the inter-clause relations better.Although "possible-01" in subgraph G 5 is mislabeled as the :ARG2 role of node "contrast-01", it succeeds in avoiding the errors made by SPRING.Another mistake in Ours is that the relation :quant between "much" and "anxiety" is reversed and replaced by :domain, which barely impacts the Smatch F1 scores.The vast performance gap between SPRING and our HCA-based SPRING in Smatch F1 scores (66.8% vs. 88.7%) also proves the effectiveness of the HCA on LDDs in AMR parsing.Table 6.F1 scores (%) of Smatch and three structure-dependent metrics achieved by our HCA-based models in ablation studies on AMR 2.0."w/o" denotes "without"."VisMask" and "ClauRel" indicate "token visibility matrices" and "clause-relation-bound attention head setting" in the HCA-based self-attention (HCA-SA) approach."CN" and "TD" represent the Clause-Number and Tree-Depth curricula in the HCA-based curriculum learning (HCA-CL) approach.Bold figures indicate the best performance achieved by our model with full features.Figures in red represent the most significant performance degradation when removing a specific feature, while those in cyan denote the slightest degradation.AMR nodes and edges in red are parsing errors compared to the gold AMR graph.Extra nodes and edges, which are correctly parsed by both, are omitted.

Conclusions
We propose two HCA-based approaches, HCA-SA and HCA-CL, to integrate HCA trees of complex sentences for addressing LDDs in AMR parsing.Taking AMR parsing as an example, we apply our HCA-based framework to a popular AMR parser SPRING to integrate HCA features in the encoder.In the evaluations on ID datasets, our parser achieves prominent and explainable improvements against the baseline model, SPRING, and outperforms the SOTA parser, HGAN, in some fine-grained metrics.Notably, as the clause number of the sentence increases, our parser outperforms SPRING by a large margin and achieves better Smatch F1 scores than HGAN, indicating the ability to cure LDDs.In the evaluations of OOD datasets, the performance boosts achieved by our HCA-based approaches are more evident on complicated corpora like New3 and Bio, where sentences consist of more numerous and longer clauses.

Figure 1 .
Figure 1.AMR relation dependency distances in different segment levels of an AMR 2.0 sentence.The input sentence is placed at the (bottom), and the sentence's clause/phrase/token-level segments are positioned in the (middle) along with the token indices.The corresponding AMR graph is displayed at the (top), where AMR relations are represented as directed edges with a dependency distance, i.e., the indices' subtraction of two tokens mapping to the source/target AMR nodes.Interclause/phrase/token relations are distinguished in separate colors, corresponding to the segment levels' colors.Note that two virtual AMR nodes in dashed boxes of the reentrant node "I" are added for simplicity.

•
Full Visibility: tokens in the same clause are fully mutually visible; • Partial Visibility: tokens from two adjacent clauses C 1 and C 2 share partial visibility; • No Visibility: tokens from non-adjacent clauses C 2 and C 5 are invisible to each other; • r c u X + s l S 9 y O L I w w m c w j l 4 c A N V u I M 6 N I D B A J 7 h F d 4 c 4 b w 4 7 8 7 H o j X n Z D P H 8 A f O 5 w + + u o 1 n < / l a t e x i t > C < l a t e x i t s h a 1 _ b a s e 6 4 = " p G a z n 5 b W 5 + 4 5 X p v M y f G N N W / O H B w = " > A A A B 6 n i c b V D L T g J B E O z F F + I L 9 e h l I j H x R H b x e S T h 4 h G j P B L Y k N m h F y b M z m 5 m Z k 0 I 4 R O 8 e N A Y r 3 6 R N r c u X + s l S 9 y O L I w w m c w j l 4 c A N V u I M 6 N I D B A J 7 h F d 4 c 4 b w 4 7 8 7 H o j X n Z D P H 8 A f O 5 w + + u o 1 n < / l a t e x i t > t e x i t s h a 1 _ b a s e 6 4 = "

2 <
w 8 u I Y 6 3 E I T W s B g B M / w C m + O c F 6 c d + d j 0 V p w 8 p l j + A P n 8 w e 6 L o 1 k < / l a t e x i t > C l a t e x i t s h a 1 _ b a s e 6 4 = " I w 1 g Z O B e q o N W U R m + S O s 4 R A M b 9 S 8 = " > A A A B 6 n i c b

2 < l a t e x i t s h a 1 _ b a s e 6 4 =
w 8 u I Y 6 3 E I T W s B g B M / w C m + O c F 6 c d + d j 0 V p w 8 p l j + A P n 8 w e 6 L o 1 k < / l a t e x i t > C " p G a z n 5 b W 5 + 4 5 X p v M y f G N N W / O H B w = " > A A A B 6 n i c b V D L T g J B E O z F F + I L 9 e h l I j H x R H b x e S T h 4 h G j P B L Y k N m h F y b M z m 5 m Z k 0 I 4 R O 8 e N A Y r 3 6 R N

5 < l a t e x i t s h a 1 _
r c u X + s l S 9 y O L I w w m c w j l 4 c A N V u I M 6 N I D B A J 7 h F d 4 c 4 b w 4 7 8 7 H o j X n Z D P H 8 A f O 5 w + + u o 1 n < / l a t e x i t > C b a s e 6 4 = "

2 < l a t e x i t s h a 1 _ b a s e 6 4 =
w 8 u I Y 6 3 E I T W s B g B M / w C m + O c F 6 c d + d j 0 V p w 8 p l j + A P n 8 w e 6 L o 1 k < / l a t e x i t > C " p G a z n 5 b W 5 + 4 5 X p v M y f G N N W / O H B w = " > A A A B 6 n i c b V D L T g J B E O z F F + I L 9 e h l I j H x R H b x e S T h 4 h G j P B L Y k N m h F y b M z m 5 m Z k 0 I 4 R O 8 e N A Y r 3 6 R N

5
Figure3.Overview of our hierarchical clause annotation (HCA)-based self-attention approach that integrates the clausal structure of input sentences.In (a), red blocks mean that clauses have the strongest correlation with themselves; the pink/white ones mean that the corresponding two are adjacent/non-adjacent in the HCA tree.In (b,c), the adjacency between two clauses is concretized in a token visibility matrix.Pink circles with a red dotted border mean one of the two corresponding tokens is a pronoun, while those with a blue solid border indicate the existence of a clausal keyword (i.e., coordinator, subordinator, or antecedent).

Figure 4 .
Figure 4. Overview of our hierarchical clause annotation (HCA)-based curriculum learning approach with two curricula, Clause-Number and Tree-Depth.The learning difficulties of the two curricula are set by the clause number and the tree depth of a sentence's HCA in the (left) and (right) charts.Two example sentences from AMR 2.0 and their HCAs are demonstrated in the middle.
In the training epoch of the Clause-Number curriculum, there are M training episodes with T td steps individually.At each step of the i-th episode, the training scheduler samples a batch of examples from buckets {I j : j ≤ i} to train the model.

4. 1 . 1 .
In-Distribution DatasetsWe first train and evaluate our HCA-based parser on two standard AMR parsing evaluation benchmarks: • AMR 2.0 includes 39,260 sentence-AMR pairs in which source sentences are collected from: the DARPA BOLT and DEFT programs, transcripts and English translations of Mandarin Chinese broadcast news programming from China Central TV, text from the Wall Street Journal, translated Xinhua news texts, various newswire data from NIST OpenMT evaluations, and weblog data used in the DARPA GALE program.• AMR 3.0 is a superset of AMR 2.0 and enriches the data instances to 59,255.New source data added to AMR 3.0 include sentences from Aesop's Fables, parallel text and the situation frame dataset developed by LDC for the DARPA LORELEI program, and lead sentences from Wikipedia articles about named entities.

Figure 5 .
Figure 5. Performances trends of SPRING, HGAN, and Ours in the AMR 2.0 dataset over the number of tokens (denoted as "#Token") and clauses (denoted as "#Clause") inside a sentence.(a) Performance trends over the number of tokens.(b) Performance trends over the number of clauses.

Figure 6 .
Figure 6.Two important characteristics of three different out-of-distribution (OOD) test sets (i.e., TLP, New3, and Bio) and performance boosts of our HCA-based parser on each test set.The blue and green statistics of each dataset represent the average number of clauses per sentence and tokens per clause, respectively.The red statistics show the improvements of our HCA-based model against the baseline model, SPRING, on each OOD dataset.

Figure 7 .
Figure 7.Parsing results of the baseline model SPRING and the modified SPRING that applies our HCA-based approaches (denoted as Ours) when encountering the same AMR 2.0 sentence in Section 1. AMR nodes and edges in red are parsing errors compared to the gold AMR graph.Extra nodes and edges, which are correctly parsed by both, are omitted.

6
If I do not check , I get very anxious , but often the anxiety is so much that I can not wait that long .which does sort of go away after 15-30 mins , Segmented clauses and the HCA tree of a sentence in AMR 2.0.Clauses C 2 and C 4 are contrasted and coordinated, dominated by the node BUT.Clauses C 1 , C 3 , and C 5 are subordinated to their matrix clauses, where cnd, rel, and res represent the inter-clause relations of Adverbial_of_Condition, Relative, and Adverbial_of_Result, respectively.
type of clause relation of this attention head 4: very anxious, C 2

Table 2 .
Final hyper-parameter configuration of the clause segmentation model."HCA-SA Encoder" indicates the HCA-based self-attention approach user in the encoder, and "HCA-CL Strategy" represents the HCA-based curriculum learning approach used before the normal training epochs.

Table 3 .
Hardware and software used in our experiments.

Table 4 .
Smatch and fine-grained F1 scores (%) of our AMR parser and comparative ones on two indistribution (ID) evaluation test sets.The column "Feat."means the extra features that an AMR parser requires, where "DP", "SRL", and "HCA" indicate syntactic dependencies, semantic role labelings, and hierarchical clause annotations, respectively.For comparison fairness, the performances of compared parsers are the versions without graph re-categorization, extra silver training data, and ensemble methods.The best result per measure across each test set is shown in bold, while that in the baseline model (SPRING) and ours is underlined."w/o" denotes "without".

Table 5 .
Smatch F1 scores (%) of our HCA-based model and comparison models on out-of-distribution (OOD) datasets.The best result for each test set is shown in bold.