Hierarchical Clause Annotation: Building a Clause-Level Corpus for Semantic Parsing with Complex Sentences

Featured Application: Hierarchical clause annotation could be applied in many downstream tasks of natural language processing, including abstract meaning representation parsing, semantic dependency parsing, text

However, previous works with such intuitions to process complex sentences are not practical for semantic parsing tasks, such as AMR [11] and semantic dependency parsing [12].RST parsing, aiming to extract the rhetorical relations among elementary discourse units (EDUs) [13] at a document level, is still an open problem, where the state-ofthe-art model only achieves 55. 4 and 80.4 Parseval-Full scores for multi-and intra-sentential parsing, respectively.Besides, the blurry definitions of EDUs and the misalignments between rhetorical relations and semantic relations make RST parsing unsuitable for semantic parsing.The SPRP task maintains a big splitting granularity, where the outputs may still be complex sentences.The TS and SSD tasks, which decompose complex sentences into simple ones, cannot preserve the original semantics for rephrasing for simpler syntax (TS) or dropping discourse connectives (SSD).
To avoid the deficiencies of the previous works summarized above, we propose a novel task, hierarchical clause annotation (HCA), based on the linguistic research of clause hierarchy [14], where clauses are fundamental text units centering on a verb phrase and sentences with multiple clauses form a complex hierarchy.Our HCA is a more-lightweight task at the sentence level, has explicit definitions of clauses and appropriate mappings between inter-clause and semantic relations (vs.RST parsing), and aims to annotate complex sentences into a clause hierarchy (vs.SPRP) without changing or dropping any semantics (vs.TS and SSD).
To show the potentialities of HCA to facilitate semantic parsing with complex sentences, we demonstrate the HCA tree, AMR graph, and semantic dependency graph (SDG) of a complex sentence from the AMR 2.0 dataset (https://catalog.ldc.upenn.edu/LDC2017T10, accessed on 15 June 2021) in Figure 1: In the HCA tree, the coordinate relation But and clauses C i are nodes, and subordinate relations are directed edges.As demonstrated in Figure 1, the HCA tree shares the same hierarchy with two semantic parsing representations, indicating the possibility of incorporating HCA's structural information into semantic parsing with complex sentences.
Inspired by the similarities among the HCA tree, AMR graph, and SDG, we annotated the first HCA corpus with the guidance in previous works of crowdsourcing annotation [15,16].Furthermore, we adapted the state-of-the-art models [17,18] of the discourse segmentation and parsing tasks for training the baseline models to generate HCA annotations automatically.
Our main contributions are as follows: 1.
We propose a novel framework, hierarchical clause annotation (HCA), to segment complex sentences into clauses and capture their interrelations based on the linguistic research of clause hierarchy, aiming to provide clause-level structural features to facilitate semantic parsing tasks.

2.
We elaborate on our experience developing a large HCA corpus-including determining an annotation framework, creating a silver-to-gold manual annotation tool and ensuring annotating quality.The resulting HCA corpus contains 19,376 English sentences from AMR 2.0, each including at least two clauses.

3.
We decomposed HCA into two subtasks, i.e., clause segmentation and parsing, and adapted discourse segmentation and parsing models for the HCA subtasks.The exper-imental results showed that the adapted models achieved satisfactory performances in providing reliable silver HCA data.
that I can not wait that long I get very anxious but often the anxiety is so much which does sort of go away after 15-30 mins after The rest of this paper is organized as follows: First, the related works are summarized in Section 2, and the proposed HCA framework, along with the manual annotation details of a large HCA corpus from scratch, are detailed in Section 3.Then, the neural end-to-end models for the clause segmentation and clause parsing subtasks are proposed in Section 4. Next, the experimental details and results of evaluating the proposed models are presented in Section 5, and the potentialities of utilizing HCA features in AMR parsing and semantic dependency parsing are discussed in Section 6.Finally, our work is concluded in Section 7.

RST Parsing
In a document, the clauses, sentences, and paragraphs are logically connected together to form a coherent discourse.RST [13] provides a general way to describe the relations among parts in a text and postulates a hierarchical discourse structure called the discourse tree (DT).The leaves of a DT can be a clause or a phrase without strict definitions, known as elementary discourse units (EDUs).Adjacent EDUs and higher-order spans are nonoverlapping and connected hierarchically through coherence relations.Thus, coherence in discourse can be analyzed in terms of how a nucleus interacts with its surrounding satellites to communicate the relationships between the main ideas and related ideas.
The RST-parsing task generally requires breaking the text into EDUs (i.e., the discourse segmentation task) and linking the EDUs into a DT (i.e., the discourse parsing task).For discourse segmentation, Gessler et al. [17] proposed a Transformer-based neural classifier that enhances contextualized word embeddings with hand-crafted features and achieved the current state-of-the-art performance in the DISRPT 2021 Shared Task on Discourse Unit Segmentation (https://sites.google.com/georgetown.edu/disrpt2021?pli=1, accessed on 1 April 2023).For discourse parsing, Kobayashi et al. [18] explored a strong baseline by integrating previous simple parsing strategies, top-down and bottom-up, with various Transformer-based pretrained language models (PLMs).
Table 1 demonstrates the comparison between our HCA and RST parsing with exemplified sentences from the RST Discourse Tagging Reference Manual (https://www.isi.edu/~marcu/discourse/, accessed on 20 March 2023).Although both tasks aim to extract a tree structure from input texts, their definitions of elementary units and target interrelations vary, leading to the following differences:

•
The elementary units of HCA are clauses, while those of RST parsing are EDUs, including clauses and phrases.The blurry definitions of an EDU may cause obstacles in RST parsing.For example, phrases "as a result of margin calls" in (1) and "Despite some their considerable incomes and assets" in (2) are segmented as EDUs, but not clauses due to the absence of a verb.Moreover, although the clause "that they have made it" functions as a predicative of the verb "feel", it cannot be annotated as an EDU.

•
The rhetorical relations in RST parsing characterize the coherence among EDUs, and some cannot map to semantic relations.For example, the semantic relation between "feel" and "that they have made it" in ( 2) is not captured in RST parsing.
In summary, the blurry definitions of EDUs and the misalignments between rhetorical and semantic relations make RST parsing unsuitable for semantic parsing compared with our HCA.

Other Similar Tasks
Some similar tasks share the idea of decomposing complex sentences into simpler parts without capturing their interrelations.

Clause Identification
For the CoNLL-2001 shared task, clause identification, Reference [19] proposes a dataset with the gold standard clause provided by the Penn Treebank II [15], where clauselevel tags (i.e., S, SBAR, SBARQ, INV, and SQ) indicate target clauses and clausal conjunctions.The clauses identified in the shared task comprise tensed clauses, non-tensed verb phrases, coordinators, and subordinators.

Split-and-Rephrase
The split-and-rephrase (SPRP) task [8] aims to split a complex input sentence into shorter sentences while preserving meaning.In that task, the emphasis is on sentence split-ting and rephrasing.There is no deletion and no lexical or phrasal simplification, but the systems must learn to split complex sentences into shorter ones and to make the syntactic transformations required by the split (e.g., turn a relative clause into a main clause).

Text Simplification
The text simplification (TS) task [20] is the process of reducing the linguistic complexity of a text to improve its understandability and readability while maintaining its original information content and meaning.Typically, it rephrases complex sentences with simpler vocabulary and syntax and ignores trivial clauses from the source.

Simple-Sentence-Decomposition
The simple sentence decomposition (SSD) task [10] converts complex sentences into a covering set of simple sentences derived from the tensed clauses in the source sentence, where shared nouns or pronouns are copied and discourse connectives (e.g., and, but, although, etc.) are dropped.

Summary
Table 2 compares our HCA task and similar tasks above with the exemplified sentence in Section 1:

•
For clause identification, coordinator "but" and subordinators "If ", "which", and "that" are segmented out.Besides, non-tensed verb phrases that function as a subject, object, or postmodifier are also target clauses in the task.These cases are out of the definition of annotated clauses in our HCA framework, as redundant hierarchies occur in capturing inter-clause relations.

•
For split-and-rephrase, the granularity of decomposing is larger than clauses due to the consideration of preserving the original meaning of the input sentence.The outputs (1) and (3) are still complex sentences with two clauses.Additionally, the output (2) is segmented from the relative clause that modifies the "anxious" in the matrix clause, leading to a syntax transformation.

•
For text simplification, dropping the subordinator "If" and the coordinator "but" leads to the uncertainties of discourse relations between output sentences.Moreover, as replacements for simpler syntax in (3) and (4) bring misalignments between substitute and substituted tokens, text simplification is unsuitable to serve as a preprocess for semantic dependency parsing, which is a token-level task.

•
For simple sentence decomposition, it also drops clausal connectives like text simplification, leading to the uncertainties of some discourse relations captured by semantic parsing.(4) But Often the anxiety is so much.
(5) that I can not wait that long.

Clause Hierarchy
Clause hierarchy can be described as a cline along which clauses distribute according to their different levels of grammatical integration [14,[21][22][23][24][25].These works propose that clause combinations in many languages can be described as a set of tighter or looser clauses.A tight clause means that a clausal constituent has, in comparison to a loose clause, more dependence on the clause with which it combines, typically a main clause.Table 3 shows three main versions of clause hierarchy and ordered linguistic phenomena according to their clause integration tightness degree.
Table 3.Three main types of clause hierarchy and the clines of their clause integration tightness degree.

Type Cline of Clause Integration Tightness Degree
Matthiessen [22] Embedded > Hypotaxis > Parataxis > Cohesive Devices > Coherence Hopper and Traugott [23] Subordination > Hypotaxis > Parataxis Matthiessen [22] presented a type of clause hierarchy that extends from syntactic clause combination to cohesion and coherence at the discourse and text level.He considered clause combination to range from tight syntactic "embedding" (e.g., infinitival clauses as a complement to a main verb) to the looser relations of "hypotaxis" (e.g., a finite adverbial) to "parataxis" (e.g., coordination).
Hopper and Traugott [23] also offered a model that shares much with Matthiessen's, where "parataxis" is the syntactic independence of clauses and "hypotaxis" is a more integrated clause that is syntactically dependent within another clause's predicate.Tighter still is "subordination", which, like "embedding" in Matthiessen's type, covers all clauses that function as a constituent essential to grammaticality, e.g., verbal arguments.
Payne's clause hierarchy [14] extended from "compound verbs" (tightest) through to separate "sentences" (loosest), where clause combination becomes more or less a single verbal element in his type.Different from the above two types, Payne argues that compound verbs (e.g., go get the book), though uncommon in English, are considered the tightest clause combination as they have two verbal elements placed adjacently in a verb phrase, one of which lacks full finiteness.
In summary, the linguistic research of clause hierarchy provides solid theoretical support for our HCA framework.

Hierarchical Clause Annotation
In this section, we elaborate the manual annotation criterion for the proposed HCA task, the representation of an HCA tree, and the process of building the first HCA corpus.

Annotation Framework
To present a framework for annotation of the clause hierarchy of complex sentences, we referenced and modified Payne's version [14] of clause hierarchy due to his pellucid and comprehensive definitions of clause combination cases.As demonstrated in Figure 2, we did not consider compound verbs as a clause combination, as these cases are uncommon and produce one-verb clauses after annotation.

Degree of grammatical integration
Loose With the above version of clause hierarchy, we synthesized the HCA framework and built a dataset under the guidance of the framework.The annotation work consisted of a preprocessing stage with silver annotations transformed from existing schemas (constituent parsing and syntactic dependency parsing) and a manual proofreading phase with gold annotations on an elaborate browser-based annotator.
We list major concepts in the HCA framework.

Sentence and Clause
Sentences, typically starting with a capitalized word and ending with a complete stop, are principally units of written grammar and annotation inputs in HCA.A sentence must consist of at least one clause.
Clauses, considered core units of grammar, center around a verb phrase that largely determines what else must or may occur [26].Clauses can be categorized by the inner verb type: • Finite: clauses that contain tensed verbs; • Non-finite: clauses that only contain non-tensed verbs such as ing-participles, ed-participles, and to-infinitives.
In the HCA framework, finite clauses should be annotated, while non-finite clauses that are separated by a comma are also segmented out.

Clause Combination
The main ways in which clauses combine to form sentences are by joining clauses of equal syntactic status (coordination) and subordinate relation (subordination): (1) Coordination and coordinator: Coordination is an interrelation between clauses that share the same syntactic status and are typically connected by a coordinator such as and, or, but, etc.In addition, coordinators can be correlative structures (e.g., either. . .or. . .and not only. . .but also. . . ) or just substituted by comma punctuation.
(2) Subordination, subordinator, and antecedent: Subordination occurs in a subordinate clause and a matrix clause that is superordinate to the subordinate clause.Subordination can be cataloged as follows: • Nominative: Function as clausal arguments or noun phrases in the matrix clause and can be subdivided into Subjective, Objective, Predicative, and Appositive.

•
Relative: Define or describe a preceding noun head in the matrix clause.

•
Adverbial: Function as a Condition, Concession, Reason, and such for the matrix clause.
Subordinators are the words that introduce a subordinate clause and indicate a semantic relation between the subordinate clause and its matrix clause, including subordinate conjunctions, relative pronouns, and relative adverbs.Simple subordinators contain a single word, e.g., that, wh-words, if , etc., while complex ones consist of more than one word, e.g., as if , so that, even though, etc. Antecedents are nouns or pronouns modified before relative clauses and nouns explained before appositive clauses.
To better explain these HCA definitions, we demonstrate some example sentences, which are segmented into multiple clauses in Table 4.

HCA Representation
As illustrated in Figure 3, we modeled the two basic hierarchical schemas with the concepts defined above, characterizing inter-clause relations with the same nucleus-satellite pattern in RST.To be specific, coordination is a multinuclear relation that involves two or more clauses (denoted as nucleus node C i ) dominated by the coordination node co, while subordination is a mononuclear relation (denoted as a directed edge sub) pointing from the matrix clause C 1 (nucleus) to its subordinate clause C 2 (satellite).As a sentence consists of more clauses, its HCA representation can be a tree structure, where each node is a clause or an inter-clause coordination, and each directed edge is an inter-clause subordination.A three-layer HCA tree of a complex sentence involving five clauses and four interrelations is demonstrated in Figure 1a.

HCA Corpus
With the annotation framework discussed above, we aimed to build an HCA corpus for further research on the possibilities of applying clausal structure features to semantic parsing tasks.We chose the AMR 2.0 dataset as our corpus base, whose 39,260 sentences were collected from the DARPA BOLT and DEFT programs, various newswire data, and weblog data.
The annotation work was conducted in two phases.First, two existing syntactic features, i.e., constituent and syntactic dependency parse trees, were employed to produce silver HCA annotations with transformation rules.Second, human annotators with prior English grammar research experience and extensive hands-on annotation training reviewed and modified silver annotations in a browser-based annotation tool.

Silver Data from Existing Schemas
Previous researchers [27][28][29][30] utilized constituent-based and syntactic dependency parse trees to extract clauses from sentences with some manual rules.Following the experience from these works, we employed Stanza [31] as our constituent parser and syntactic dependency parser to obtain silver HCA data: • Constituency parse tree: The constituency parse tree (CPT) represents the syntactic structure of a sentence using a tree, where the nodes are sub-phrases that belong to a specific category in the grammar and the edges are unlabeled.The transformation from the CPT to the silver HCA data consists of three phases: 1.
Traverse non-leaf nodes in the CPT and find the clause-type nodes: S, SBAR, SBARQ, INV, and SQ.

2.
Identify the tokens dominated by a clause-type node as a clause.

3.
When a clause-type node dominates another one, an inter-clause relation between them is determined without an exact relation type.
As demonstrated in Figure 4, the first two clauses of the sentence exemplified in Section 1 are identified through their constituent parse tree.The SBAR node and its child node S are combined as a single clause, as no VP is dominated by the other child constituent IN of SBAR.Moreover, the S node on the top dominates the SBAR node, indicating subordination between the two clauses in dashed boxes.• Syntactic dependency parse tree: The syntactic dependency parse tree (SDPT) consists of a set of directed syntactic relations between the words in the sentence whose root is either a non-copular verb or the subject complement of a copular verb.The transformation from SDPT to silver HCA consists of three phases: 1.
Use a mapping of dependency relations to clause constituents: subjects (S) and the governor, i.e., a non-copular verb (V), via relation nsubj and such; objects (O) and complements (C) in V's dependents via relations dobj, iobj, xcomp, ccomp, and such; adverbials (A) in V's dependents via relations advmod, advcl, prep_in, and such.

2.
When detecting a verb (Note that a copular verb in a clause and other constituents are dependents of the complement) in the sentence, a corresponding clause, consisting of the verb and its dependent constituents, can be identified.

3.
If a clause governs another clause via a dependency relation, the interrelation between them can be determined by the relation label: As demonstrated in Figure 5, the first two clauses of the sentence exemplified in Section 1 are identified through their syntactic dependency tree.Moreover, the inter-clause relation can be inferred as adverbial: conditional with the dependency relation advcl and the subordinator "If".

Gold Data from Manual Annotator
As discussed above, the syntactic structures of CPT and SDPT can be transformed into clauses and inter-clause relations.However, these silver annotations are still unable to fulfill the need to build an HCA corpus for the following reasons: • Specific inter-clause relations cannot be obtained via the two syntactic structures, where CPT can only provide the existence of a relation without a label, and the dependency relations in SDPT have multiple mappings (e.g., ccomp to Predicative or Relative) or no mapping (e.g., advcl to no exact adverbial subtype such as conditional).

•
Pre-set transformation rules identify more clauses out of the HCA definitions.For example, the extracted non-finite clauses (e.g., to-infinitives) embedded in its matrix clause are too short and lead to hierarchical redundancies in the HCA tree.

•
The performances of two syntactic parsers degrade when encountering long and complex sentences, which are the core concerns of our HCA corpus.
Therefore, we recruited a group of human annotators with prior English grammar research experience to proofread these silver HCAs on a browser-based software ClausAnn 1.0 created for the annotation work.The Java Web application ClausAnn provides convenient operations and efficient keyboard shortcuts for annotators, and we open sourced it on our GitHub repository (https://github.com/MetroVancloud/ClausAnn,accessed on 15 May 2023).A typical annotation trial on ClausAnn consists of the following steps: 1.
Review annotations from CPT, SDPT, or other annotators by switching the name tags in Figure 6a.

2.
Choose an existing annotation to proofread or just start from the original sentence.

3.
Segment a text span into two by double-clicking the blank space of a split point and select the relation between them in Figure 6b.

Quality Assurance
There were mainly two steps taken jointly to ensure the quality of the final HCA corpus, i.e., multi-round annotation and consistency measurement.
The total annotation work consisted of 39,260 sentences, and three rounds of annotation were arranged by 5%, 5%, and 90% of the total sentences and conducted with a progressive and negotiable strategy.Before the first round, every annotator thoroughly understood the HCA framework and used the tool ClausAnn proficiently after adequate hands-on training.During the first two rounds, the lead annotator, who majors in English grammar, organized a discussion on complex or abnormal cases with other annotators.
For consistency measurement, we tracked inter-annotator agreement (IAA) after each round of the annotation work.As discussed in Section 2.1, the HCA and RST parsing tasks aim to extract the clause/EDU hierarchy from texts.Thus, the evaluation metrics of the discourse segmentation [17] and discourse parsing [18] subtasks in RST parsing were adopted as the IAA metrics in evaluating the annotation quality of the HCA corpus: • P/R/F 1 on clauses: precision, recall, and F 1 -score on the segmented clauses, where a positive match means that both segmented clauses from two annotators have the same start and end boundaries.
In the first two rounds of annotation, a total of 10% sentences were double-annotated, and the ratio was 16% in the last round, higher than 13.8% in the RST-DT corpus [7].According to the statistics, the IAA measured by the above two metrics grew as the annotation rounds increased, indicating that the two steps of multi-round annotation and consistency measurement played a significant role in ensuring annotation quality.
As shown in Table 5, the final IAA achieved high consistencies, where:  Compared with the RST-DT corpus, whose IAA score on EDUs ranged from 95.1 to 100 and IAA scores on rhetorical relations with three metrics, spans, nuclearity, relation, ranged from 77.8, 69.5, and 59.7 to 92.9, 88.2, and 79.2, our HCA corpus reached better consistencies, as the HCA framework has more-restricted definitions on the elementary unit (i.e., clauses) and fewer types of interrelation.

Dataset Detail
The resulting HCA-AMR2.0dataset was based on AMR 2.0, which contains 39,260 sentences, and 19,376 (49.4%) sentences were paired with an HCA tree, while the rest were simple sentences with only one clause.The train, dev, and test set split followed the original split in AMR 2.0.Detailed statistics are listed in Table 6.Table 6.Main statistics of the hierarchical clause annotation dataset based on AMR 2.0 (HCA-AMR2.0).* means that some input sequences contain multiple sentences, and the coordination MulSnt is necessary for these inter-sentence relations in these cases.indicates that Adverbial can be divided into nine sub-types such as Condition, Concession, and Purpose.Note that "#" represents the number of the subsequent item.

Model
In this paper, we modeled hierarchical clause annotation (HCA) as a two-stage task, i.e., clause segmentation and clause parsing, and provide auto-annotation baselines for each subtask.Clause segmentation segments a complex sentence into several clauses, while clause parsing links the clauses with interrelations into a clause tree.

Clause Segmentation
We modeled clause segmentation as a sequence-labeling task, where the input sentence is a sequence X = (x 1 , . . ., x i , . . ., x n ) with n tokens and the output label sequence Y = (y 1 , . . ., y i , . . ., y n ).Note that y i is binary, i.e., y i = 1 if x i is the head word token for a clause, and otherwise, y i = 0. Therefore, we encountered the clause segmentation task with a sequence-tagging model, DisCoDisCo [17], used in the discourse segmentation task: embed the input sentence; encode with a single Bi-LSTM; decode with a linear projection layer; indicate the first token of each clause in the output tag sequence.
Additionally, we introduced and embedded various grammatical information such as lemmas, parts-of-speech (POSs), and syntactic dependencies generated by Stanza [31].These tokenwise feature embeddings were concatenated with word embeddings, and the word w i in the input sentence was embedded as: where u word i is concatenated from three kinds of word embeddings and u f eat i is grammatical features embeddings.
The embedding u 1:n of the input sentence with n tokens were fed through a BiLSTM network, and the output s i was then calculated by a linear projection layer to predict the segmentation tag for the i-th token: Given s i , we obtained the predicted tag sequence t and optimized the cross-entropy loss of L( t, t) to train the weights of the PLM, the BiLSTM network, and the linear projection layer, where t is the gold tag sequence.

Clause Parsing
Given segmented clauses, {c 1 , . . ., c i , . . ., c m }, in the input sentence, we modeled clause parsing as a sequence-to-sequence transduction, where the output is a linearized binary tree consisting of m clauses and m − 1 inter-clause relations.Considering the same modeling of clause and discourse parsing, we employed discourse parsers in [18], which contain a span-based parser with the top-down strategy and a shift-reduce transition parser with the bottom-up strategy for their simple architectures and open codes.Overviews of the parsers are shown in Figures 7 and 8.Note that the two parsing strategies share the same word embedding layer to represent text spans.

Text Span Embedding
In the process of clause parsing, the representation of text spans is needed for either "span-to-clause" splitting in the top-down parsing strategy or "clause-to-span" combining in the bottom-up parsing strategy.Therefore, we transformed the input sentence into a subword sequence {t 1 , t 2 , . . .t n } and obtained the embedding {w 1 , 2 , . . .w n } using a PLM.The embedding for a text span u i:j , consisting of the i-th clause to the j-th clause, is obtained by averaging the vector of both edge subwords: where b(i) returns the index of the begin subword in the i-th clause and e(j) returns that of the end subword in the j-th clause.

Top-Down Strategy
The top-down parser splits each span into smaller ones recursively until the span becomes a single clause.We introduced biaffine networks [34] for span splitting and a loss penalty.
For each position k in a span consisting of the i-th clause to the j-th clause, a scoring function s split (i, j, k) is defined as follows: where W, v left , and v right are weight matrices in the biaffine layer for splitting a text span.
Here, h i:k and h k+1:j are defined as follows: Then, the span is split at the end of an inner clause that maximizes Equation ( 4): When splitting a span at the end of the k-th clause, the score of the nuclearity and relation labels for the two spans is defined as follows: where W , v left , and v right are weight matrices in the biaffine layer for predicting an interclause relation.Then, the label that maximizes Equation ( 8) is assigned to the spans: where L denotes three nuclearity labels, { sub.
←→}, for predicting the nuclearity and a set of inter-clause relations labels for predicting the exact relation.Note that the parameters in biaffine layers and the FFNs for the nuclearity and relation labeling are learned separately.

Bottom-Up Strategy
Formally, in a shift-reduce model with a bottom-up strategy, the parsing state is denoted as a tuple(S, Q), where S is a stack that stores processed clauses and Q is a queue that contains incoming clauses.Each element in S can be an unreduced clause e i or a combined composite item e i:j .At each step, the parser chooses one of the following actions with an FFN classifier and updates the state (S, Q): • SH IFT: pop the first clause off Q, and push it onto S. • REDUCE: pop two elements from S, and push a new combined composite item that has the popped subtrees as its children onto S as a single composite item.
We employed three FFN classifiers, FFN act , FFN nuc , and FFN rel , where FFN act predicts an action, and the remaining two decide the nuclearity and the label of an inter-clause relation after a REDUCE action.Specifically, the output dimension of FFN act is 2 (SH IFT or REDUCE), that of FFN nuc is 3 ( ←→), and that of FFN rel is the number of inter-clause relations defined in the HCA framework.Three classifier outputs s * are defined as where function Concat concatenates three state vectors: u s 0 is the representation of the top clause stored in S, u s 1 is that in the second clause of S, and u q 0 is that in the first clause in Q. Weights of the PLM and each FFN are trained by optimizing the cross-entropy loss of s act , s nuc , and s rel .

Experiments
In this section, we elaborate on the experimental details of the proposed baseline models for two HCA subtasks, clause segmentation, and clause parsing, on the novel HCA-AMR2.0corpus.

Dataset
Except for the HCA-AMR2.0corpus, we reference three discourse analysis corpora, GUM [35], STAC [36], and RST-DT [7], to further verify the effectiveness of the proposed models and expound the distinctive advantages of HCA-AMR2.0over these discourse analysis corpora: For the clause segmentation subtask, we selected all three discourse analysis datasets evaluated on the discourse segmentation task as a comparison, and the data files are all in conll format.For the clause parsing subtask, we only selected the RST-DT dataset evaluated on the discourse parsing task as a comparison, as previous works of discourse parsing experiments on the other two datasets used accuracy as the evaluating metric, other than Parseval.The interrelation data files in HCA and RST-DT were preprocessed by Heilman and Sagae's system [37].The important dataset statistics related to the experiments are listed in Table 7.
Table 7. Main statistics of the HCA-AMR2.0,GUM, STAC, and RST-DT datasets.Note that "#" represents the number of the subsequent item, "Unit" or "U" represents the clause or elementary discourse unit (EDU), "S" represents sentences, and "Rels."represents inter-clause/EDU relations.Thus, "# Units/Sentences" means the number of units or sentences, "# Avg.U/S" means the average number of units per sentence, and "# Avg.Rels./U" means the average number of inter-clause/EDU relations per unit.

Experimental Environments
The information on the main hardware and software used in our experimental environments is listed in Table 8.

Hyper-Parameters
For the hyper-parameters in the models for clause segmentation and clause parsing subtasks, we list their layer, name, and value in Table 9.Note that both models are not trained until reaching the maximum epochs, where the model of clause segmentation is trained in about thirteen epochs for five hours and the model of clause parsing is finished in about seven epochs for ten hours with the experimental environments introduced in Section 5.2.For the batch size, we tested several candidates and found that there was no significant performance gaps.For the optimizer, we tested AdamW and SGD in the hyper-parameter-tuning experiments of both clause segmentation and parsing subtasks.The results showed that AdamW outperformed SGD and converged faster.

Evaluation Metrics
As described in Section 3.3.3,we introduced two metrics for IAA to evaluate the annotation consistencies between two annotators.Thus, we still used P/R/F 1 on clauses for the clause segmentation subtask and RST-Parseval on inter-clause relations for the clause parsing subtask to evaluate the consistencies between gold and predicted annotations.

Baseline Models
We aimed to utilize models proposed by previous works on the discourse segmentation and discourse parsing tasks and experimented with our HCA-AMR2.0corpus to obtain effective baseline models for the novel clause segmentation and parsing subtasks.

Experimental Results
The experiments of the clause segmentation and parsing subtasks were conducted separately, and the experimental results were compared with those of previous works on discourse segmentation and parsing tasks, respectively.

Results of Clause Segmentation
We adapted the discourse parser DisCoDisCo [17] to the clause segmentation task and conducted an ablation study on different features, e.g., lemma, syntactic dependency, parts-of-speech, and static word embedding fastText, to explore which contributed more to the performances.The experimental results of the clause segmentation task are reported in Table 10, as well as the performances of previous works on the discourse segmentation task for comparison.From the results, we made the following observations:

•
Compared with the 98.4-100 IAA scores on clause segmentation mentioned in Section 3.3.3,the adapted DisCoDisCo model achieved satisfactory performances, i.e., 91.3 F 1 -scores.

•
In the ablation study of different embedded features, the static word embedding fastText, which gained a 4.9 F 1 -score improvement, contributed the most to all features, although other features also had positive impacts on the performances.

•
In the discourse segmentation task, the DisCoDisCo model outperformed the Gum-Drop model on the GUM and RST-DT datasets.Although GumDrop performed better than DisCoDisCo on the STAC dataset, its performance sharply declined with 14.7 F 1 -scores when removing extra gold features.Thus, we chose to apply the DisCoDisCo model to the clause segmentation subtask.

•
Experimenting with the same model DisCoDisCo, performances on clause segmentation had about 3-5 F 1 -scores, lower than on the discourse segmentation, indicating that the performance gaps could be attributed to the corpora.From the statistics in Table 6, an average of 3.1 clauses constitute a sentence in HCA-AMR2.0,more than that in GUM (2.3 EDUs), STAC (1.1 EDUs), and RST-DT (2.6 EDUs).Besides, only HCA-AMR2.0contains a certain number of weblog data, which are less formal in English grammar and may cause more obstacles for the DisCoDisCo model.We employed the bottom-up and top-down discourse parsers in [18] as our clauseparsing models and conducted experimental trials of the base version of different pretrained language models (PLMs, i.e., BERT [39], RoBERTa [40], SpanBERT [41], XLNet [42], and DeBERTa [43]) to obtain better performances.Table 11 demonstrates the main results of both the bottom-up and top-down models on clause parsing, as well as that on discourse parsing for comparison.From the results, we have the following observations:

•
Better performances were obtained in clause parsing than in discourse parsing by either top-down or bottom-up parsers with whatever PLMs.

•
The best performances of clause parsing were obtained by the bottom-up parser with the pretrained DeBERTa, where the performance reached up to 97.0 F 1 -scores on Parseval-Span and fell to 87.7 F 1 -scores on Parseval-Full.

•
All the best performances in either clause parsing or discourse parsing were obtained by parsers with pretrained XLNet and DeBERTa, indicating that these two PLMs are more suitable for relation classification tasks than other PLMs.

•
Experimenting with a same model and the same pretrained language model, the performances on clause parsing had about 6-9 Parseval-Full scores, higher than that on discourse parsing, indicating that the performance gaps could be attributed to the differences between corpora.RST-DT contains 18 classes of relations partitioned from 78 types of blurry rhetorical relations, while HCA has 18 types of distinguishable semantic relations, half of which were subdivided from adverbial.To obtain a clause parser with better performances, we conducted experimental trials with the large versions of the PLMs, and the results are illustrated in Table 12.As can be observed from the results:

•
All the parsers with the corresponding large PLMs performed better than with the base PLMs.

•
The bottom-up parser with DeBERTa-large achieved a better performance, with a 0.8 F 1 -score improvement on Parseval-Full over the parser with DeBERTa-base.As discussed in Section 1, we demonstrated the same hierarchy of the HCA tree, the AMR graph, and the SDG of a complex sentence, indicating the potentialities of utilizing the structural information of the HCA tree to improve semantic parsing.Thus, we provided two case studies, where simple transformation rules derived from the HCA tree can be applied to these two semantic parsing tasks.

Case Study for AMR Parsing
For AMR parsing, we employed the state-of-the-art AMR parser proposed in [1] to predict the AMR graph of the exemplified sentence in Section 1.As shown in Figure 9, two dotted red edges were missed by the parser when compared with the gold AMR, while two solid red edges were mistakenly predicted.From the HCA tree given in Figure 1a, the clause "I get very anxious" mapping to the subgraph G 2 and the clause "but often the anxiety is so much" mapping to subgraph G 4 were coordinate and contrastive.Therefore, we provide a transformation rule: 1. Transform the inter-clause relation but to an AMR node contrast-01 and two AMR edges directing to the root nodes of G 2 and G 4 .
Meanwhile, the clause "that I can not wait that long" mapping to subgraph G 5 is a resultative adverbial clause subordinated to the clause "but often the anxiety is so much" mapping to the subgraph G 4 .Note that the verb of the matrix clause is the copular "is", and the subordinate clause modifies the complement "much" in the matrix clause.Therefore, a new transformation rule can be derived: With these two transformation rules, we can delete the solid red edges, which were mistakenly predicted, and add the dotted red edges, which were missed.

Case Study for Semantic Dependency Parsing
For semantic dependency parsing, we employed the state-of-the-art AMR parser proposed in [2] to predict the semantic dependency graph (SDG) of the exemplified sentence in Section 1.As shown in Figure 10, three dotted red edges were missed by the parser when compared with the gold SDG.As discussed in Section 6.1.1,inter-clause relations among three clauses mapping to subgraphs G 2 , G 4 , and G 5 can be derived by the following transformation rules: 1.
Transform the inter-clause relation but to a dependency edge between root nodes (i.e., anxious and much) of G 2 and G 4 .

2.
Transform the inter-clause relation resultative to a dependency edge between root nodes (i.e., much and can) of G 4 and G 5 .
Moreover, the relative subordinate clause "which does sort of go away after 15-30 min" mapping to subgraph G 3 modifies the complement "anxious" in the matrix clause "I get very anxious" mapping to subgraph G 2 .Thus, a new transformation rule can be derived:

3.
Transform the inter-clause relation relative to a dependency edge between the root node go G 3 and the node anxious) of G 2 .
With these three transformation rules, we can add the dotted red edges, which were missed by the parser.
To sum up, we provided two case studies demonstrating the potentialities of HCA in semantic parsing, where transformation rules derived from the HCA tree were applied to modify two state-of-the-art parsers of the AMR parsing and semantic dependency parsing tasks.

Future Work
As demonstrated in the experimental results, we adapted the discourse segmenter and discourse parser into our HCA subtasks (i.e., clause segmentation and parsing) and achieved satisfactory performances.However, there is still much room for improvement compared with the IAA scores of manual annotation, and it is our biasness to adapt existing models of discourse segmentation and parsing for our HCA subtasks to serve as baselines.Therefore, we aim to design better models for the HCA task in further research.
In the case studies, we derived some transformation rules from the HCA tree to modify the AMR graph and the SDG, which is explainable to humans, but impractical.For this limitation, we aim to explore better ways of integrating the HCA structure in semantic parsing and more-downstream NLP tasks.

Conclusions
In this paper, we proposed a novel framework, hierarchical clause annotation (HCA), to segment complex sentences into clauses and capture inter-clause relations with strict definitions from the linguistic research of clause hierarchy.We aimed to explore the potentialities of integrating the HCA structural features in semantic parsing with complex sentences, avoiding the deficiencies of previous works such as RST-parsing, SRRP, TS, SSD, etc.Following the HCA framework, we built up a large HCA corpus comprising 19,376 English sentences from AMR 2.0.The annotation consisted of silver data transformed from the constituency and syntactic dependency parse trees and gold data annotated by experienced human annotators using a newly created tool, ClausAnn.Moreover, we decomposed HCA into two subtasks, i.e., clause segmentation and clause parsing, and provided effective baseline models for both subtasks to generate more HCA data.

[
I f I do not check, ] C 1 [I get very anxious, ] C 2 [which does sort o f go away a f ter 15-30 mins, ] C 3 [but o f ten the anxiety is so much] C 4 [that I can not wait that long.]C 5 The above sentence is segmented into five clauses C i , where: • C 2 and C 4 are coordinate and contrastive; • C 1 and C 3 are conditional adverbial and relative clauses of C 2 , respectively; • C 5 is a resultative adverbial clause of C 4 .

Figure 1 .
Figure 1.Clauses C i in (a) correspond to subgraphs G i in (b,c), respectively.Colored directed edges in (a) are inter-clause relations, mapping the same-colored AMR nodes and edges in (b) and semantic dependencies in (c).Note that reentrant AMR relations in (b) introduced by the pronoun "I" are omitted to save space, as well as semantic dependencies between orphan tokens and the root token "If" in (c).

Figure 3 .
Figure 3. Two basic hierarchical schemas in HCA, where node C i , node co , and edge sub represent a clause, coordination, and subordination, respectively.

Figure 4 .
Figure 4. Extract clauses and inter-clause relations via the constituency parse tree.Two clauses in dashed boxes are identified by underlined clause-type nodes S and the child node SBAR.Note that child constituent nodes of the left VP and the right ADJP are omitted to save space.

Figure 5 .
Figure 5. Extract clauses and inter-clause relations via the syntactic dependency parse tree.Two clauses in dashed boxes are identified by the underlined verb and the governed constituents.The inter-clause relation adverbial can be determined by the dependency advcl between the two clauses.

Figure 6 .
Figure 6.Operating steps of an annotation trial in the browser-based tool, ClausAnn.(a) Switch annotator labels and review the corresponding annotation.(b) Segment a text span into two and choose a coordination or subordination between them.(c) When choosing coordination, select the exact coordinate relation, i.e., and, or, or but.(d) When choosing subordination, select the superordinate clause and the exact subordinate relation, i.e., Subjective, Objective, and such.

•
HCA-AMR2.0:The first HCA corpus was annotated on sentences in the AMR 2.0 dataset.The source data included discussion forums collected for the DARPA BOLT AND DEFT programs, transcripts and English translations of Mandarin Chinese broadcast news programming from China Central TV, Wall Street Journal text, translated Xinhua news texts, various newswire data from NIST OpenMT evaluations, and weblog data used in the DARPA GALE program.• GUM: The Georgetown University Multilayer corpus was created as part of the course LING-367 (Computational Corpus Linguistics) at Georgetown University, and its annotation followed the RST-DT segmentation guidelines for English.The text sources consist of 35 documents of news, interviews, instruction books, and travel guides from WikiNews, WikiHow, and WikiVoyage.• STAC: The Strategic Conversation dataset is a corpus of strategic chat conversations in 45 games annotated with negotiation-related information, dialogue acts, and discourse structures in the segmented discourse representation theory (SDRT) framework.• RST-DT: RST Discourse Treebank was developed by researchers at the Information Sciences Institute (University of Southern California), the U.S. Department of Defense, and the Linguistic Data Consortium (LDC).It comprises 385 Wall Street Journal articles from the Penn Treebank annotated with discourse structure in the RST framework.

Figure 9 .
Figure 9. Abstract meaning representation (AMR) graph predicted by the state-of-the-art AMR parser.Red dotted relation edges, which were missed by the parser, can be recovered by transformation rules derived from the HCA tree.Red solid relation edges, which were mistakenly predicted by the parser, can be deleted by transformation rules derived from the HCA tree.

2 .
Transform the inter-clause relation resultative to an AMR node cause-01 and two AMR edges directing to the root node of G 5 and the node much in G 4 .

Figure 10 .
Figure 10.Semantic dependency graph (SDG) predicted by the state-of-the-art semantic dependency parser, DynGL-SDP.Dotted red dependency edges, which were missed by the parser, can be recovered by transformation rules derived from the HCA tree.

Table 1 .
Comparison between our HCA task and the RST-parsing task.Two exemplified sentences are from the RST Discourse Tagging Reference Manual.Units (i.e., clauses or EDUs) are segmented by square brackets and index marks i .Relations between units are represented as arrows directed from a matrix clause or a nucleus EDU to a subordinate clause or a satellite EDU with a specific relation.

Table 2 .
Comparison between our HCA task and similar tasks that decompose complex sentences into parts.The input sentence "If I do not check, I get very anxious, which does sort of go away after 15-30 min, but the anxiety is so much that I can not wait that long." was selected from the AMR 2.0 dataset and exemplified in Section 1. Underlined words in the Output Example column of each task are modified from the original sentence, while crossed words are deleted from the original sentence.
And [He should have been here at five] C 1 [ and he's not here yet.]C 2 Subordination (2) Subjective [What we follow] C 1 [is a foreign security strategic philosophy.]C 2 (3) Objective [He knows] C 1 [what it takes to start a business here.]C 2 (4) Predicative [The reason is] C 1 [that you lack confidence.]C 2 (5) Appositive [I've accepted :::: defeat] C 1 [that this year of my life is a failure.]C 2 (6) Relative [We've entered into an :: age] C 1 [when dreams can be achieved.]C 2 (7) Adverbial [He'd need to do his exam] C 1 [before he went.]C 2

Table 5 .
Inter-annotation agreement (IAA) of 16% double-annotated sentences in the HCA corpus by ten annotators marked as 1 to 10.Note that bold and underlined figures indicate the highest and lowest consistencies in the corresponding metrics, respectively.

Table 8 .
Hardware and software used in our experiments.

Table 9 .
Final hyper-parameters' configuration of the clause segmentation model.Note that "#" represents the number of the subsequent item.

Table 10 .
Performances of the adapted DisCoDisCo model on HCA-AMR2.0for clause segmentation, and performances of DisCoDisCo and GumDrop on three datasets for the contrastive task, discourse segmentation.Note that * and • indicate gold annotated features from the corresponding dataset and silver features annotated by Stanza, respectively.Bold numbers are the best scores on each dataset.All the experiments on the clause segmentation task were conducted for five runs with different seeds, and the experimental results were averaged.

Table 12 .
Clause-parsing results with large versions of pretrained language models (PLMs), XLNet and DeBERTa (RST-Parseval).† indicates PLMs with a large version.Standard deviations for three runs are shown in parentheses.Bold numbers are the best scores for each model.