1. Introduction
Discourse relation recognition (DRR) is a crucial element in discourse analysis [
1], playing a significant role in downstream tasks such as casual reasoning [
2], machine translation [
3], and information extraction [
4]. The Penn Discourse Treebank (PDTB) [
5,
6] and the Rhetorical Structure Theory (RST) [
7] serve as the mainstream discourse-level annotation methods for discourse analysis. PDTB targets the localized relationship within a discourse, while RST is crafted for the overall discourse structure analysis.
Despite the complexity of PDTB annotation and its high human resource costs, numerous datasets following PDTB annotation standards [
5] have emerged in the past several decades. These datasets cover English [
5,
6], German [
8], Portuguese [
9], Persian [
10], Chinese [
11,
12], and a multilingual resource corpus of six languages (English, Polish, German, Russian, European Portuguese, and Turkish) [
13]. However, none of these datasets pay attention to the sentences’ linguistic semantic knowledge.
Currently, the research on DRR is focused on implicit discourse relation recognition (IDRR), and most of the methods mainly leverage human-annotated connectives to enhance their performance [
14]. DRR can be regarded as a classification task, where the input consists of argument1 and argument2, and the goal is to determine the discourse relationship between the two [
15]. Although most existing methods have already employed large language models to capture contextual features [
16], we contend that these methods fall short of considering linguistic knowledge, especially neglecting the internal characteristics of sentences. This becomes particularly evident when dealing with sentences that involve adversative discourse relations.
Example 1 shows an adversative sentence (To be concise, we will represent “adversative complex sentence” as “adversative sentence”). Considering the context provided, we can infer that she stopped because “she realized she couldn’t waste this water when there are people in Watsonville who don’t have fresh water to drink”. However, the PDTB annotation does not address the reasons. Therefore, we believe that exploring the linguistic features of sentences in texts with discourse relations will enhance discourse comprehension. Specifically, we think introducing extra-linguistic features would enhance the accuracy of recognizing discourse relations. Currently, numerous corpora are annotated for causal relations [
17,
18,
19], but there is a lack of available data for adversative relations. However, adversative relations are also essential in discourse, highlighting the necessity for the development of a relevant corpus. The primary challenge is the difficulty in annotation, which demands the involvement of professionals and significant human resources. Additionally, leveraging semantic information also necessitates linguistic support.
Example 1. Last Sunday, Ms. Johnson [finally got a chance to water her plants]Arg1, but [stopped abruptly.]Arg2 "I realized I couldn’t waste this water when there are people in Watsonville who don’t have fresh water to drink." [WSJ 0766] Sense: Comparison.Concession.Arg2-as-denier
Example 1 is an adversative sentence from the PDTB 3.0 corpus, with a comparison of senses. Concession is used when an expected causal relation is cancelled or denied by the situation described in one of the arguments. Arg2-as-denier is used when Arg1 raises an expectation of some consequence, while Arg2 denies it [
6].
In this work, we constructed a Semantic Augmented Chinese Adversative corpus (SACA) to address the issue stated earlier. Unlike other corpora, the texts in SACA specifically focus on adversative sentences. Our analysis of the corpus will concentrate on two dimensions: overall sense relationship classifications and the internal semantic elements. For the overall sense relationship classifications, we basically follow PDTB annotation guidelines [
5,
6]. However, considering the differences between Chinese and English, as well as the semantic characteristics of adversative sentences, we have also referenced the classification method of the CDTB [
12,
20], dividing the overall sense relation classification of adversative sentences into eight categories in
Section 3.1: cause, condition, direct contrast, indirect contrast, concession, expansion, progression, and coordination. Subsequently, to utilize linguistic features, we discussed the determination of the internal semantic elements and their symbolic representation. An adversative sentence is a grammatical structure commonly used to express a contrast or opposite situation to the viewpoints, plots, or situations mentioned in the preceding context [
21]. Adversative sentences can make the text more vivid and specific, guiding readers to notice aspects or changes that differ from the previous content [
22]. Zeng [
23] suggests that adversative sentences indicate a contrast between the actual and expected results, and Yuan [
24] thinks there is always a reason for the contrast leading to the opposition between the actual and expected results. At the same time, through observing numerous examples, we find that there is always a background for the entire context in adversative sentences. Therefore, we used a quadruple (
P,
Q,
R,
) to represent the concepts in
Section 3.2 corresponding to the internal semantic elements of an adversative sentence: premise, expected result, reason, and unexpected result.
The annotation in SACA basically follows PDTB annotation guidelines [
5,
6] and introduces the concept of internal semantic elements shown in
Table 1. The corpus includes 9546 text segments, categorized into eight sense types, with each sentence labelled with internal semantic elements. Compared with the existing corpora mentioned in
Section 2.1, our corpus has the following features: (1) It highlights eight overall sense relationships in adversative sentences. (2) It introduces internal semantic elements into adversative sentences. (3) It incorporates paragraph-level content for added contextual detail.
Based on this corpus, we can combine linguistic knowledge with deep learning techniques, enabling a more systematic and scientific exploration of DRR. We propose a task known as Chinese Adversative Discourse Relation Recognition (CADRR), which is analogous to the DRR task and aims to identify relations in Chinese adversative sentences. We have developed a model that skillfully integrates internal semantic information for CADRR tasks, demonstrating the usability and effectiveness of these semantic elements. This corpus provides a new perspective on discourse relation recognition by utilizing its internal semantic elements. By examining the connection between these elements and their sense types, we can enhance our grasp of discourse relations and language structure, potentially uncovering novel language principles.
Our contributions include:
We provide a relatively large-scale semantic augmented Chinese adversative discourse treebank. It follows PDTB annotations for sense types and annotates internal semantic elements of adversative complex sentences.
We analyze this corpus, exploring the connection between sense classification and internal semantic element classification.
We introduce a new task called CADRR (Chinese Adversative Discourse Relation Recognition), aimed at predicting discourse relationships for Chinese adversative sentences. We then classify the corpus using deep learning models and our proposed method utilizing internal semantic elements. Results indicate the effectiveness of our internal semantic features and the applicability of our SACA corpus.
The remainder of this paper is structured as follows:
Section 2 reviews the PDTB format of discourse relation annotation corpora, discourse relation recognition models, and research related to adversative sentences. In
Section 3, we discuss the details of corpus construction, including categorization, data sources, preprocessing, annotation process, and consistency checks. In
Section 4, we conduct a detailed corpus analysis, especially focusing on the relevance between internal semantic elements and overall categorization.
Section 5 introduces the CADRR task in
Section 5.1, elaborates our classification model enhanced by internal semantic elements in
Section 5.2, and presents and discusses the results in
Section 5.4. Finally, in
Section 6, we summarize the main conclusions of our study and discuss potential directions for future work.
4. Analysis
4.1. Sense-Level Features
From
Table 3, it can be seen that concession accounts for the largest proportion in the overall context at 54.01%, followed by direct contrast and indirect contrast at 13.85% and 10.87%, respectively. This indicates that concession relationships, direct contrast, and indirect contrast are very common in Chinese adversative sentences.
4.2. Internal Semantic Element Level Features
Through the semantic annotation task, SACA generated more than 60 arrangements of semantic elements. After excluding instances of individual category labeling errors and arrangements deemed irrelevant, we processed the data and obtained 19 internal semantic element arrangements (
Table 4). The top five arrangements, with their respective proportions, are as follows: (
P,
) holds the most significant share (22.65% of the total), followed by (
P,
R,
) (20.02%), (
P,
Q,
R,
) (11.76%), (
P,
,
R) (10.62%), and (
R,
P,
) (9.65%).
Through observation, we know that there are often cases of missing semantic elements in real language corpora. We have found that an adversative semantic is often formed when two or more internal semantic elements appear together, and the arrangement of these elements tends to follow certain patterns. There are three situations where all four semantic elements appear simultaneously, including (P, Q, , R), (P, Q, R, ), and (Q, P, R, ). It is noted that the positions of P and Q can be interchangeable, and the positions of R and can generally be interchangeable as well. Additionally, P and Q often appear before R and . Q is the most easily omitted element, while P and are the most frequently occurring elements. For individual internal semantic elements, we have found that P and R, as well as P and , often appear simultaneously. On the other hand, Q and R have a relatively low co-occurrence.
4.3. Semantic–Sense Relation Analysis
We have obtained sense classifications and their internal semantic element annotations. Our objective is to investigate the potential correlations between them. We introduce the concept of mutual information (MI) for calculating correlations. MI is a measure used to quantify the degree of association or dependence between two random variables [
39]. Specifically, it assesses how much knowing the value of one variable reduces uncertainty about the other. The mutual information between variables
X and
Y is calculated based on the probabilities of their joint occurrences compared to those expected under the assumption of independence. A higher mutual information value indicates a stronger relationship between the variables, implying that knowledge about one variable provides more information about the other.
Equation (
2) represents the calculation of mutual information between two discrete random variables, X and Y. In this formula,
denotes the joint probability of
X and
Y taking values
x and
y simultaneously.
and
represent the marginal probabilities of
X and
Y, respectively.
The mutual information calculation results are presented in
Table 5 between specific combinations of internal semantic elements and overall senses. The internal semantic element patterns occurring at least 100 times were chosen for analysis. In the table, Lower MI values suggest weaker correlations, while higher MI values indicate stronger associations.
The pattern (P, Q, R, ) has relatively high MI values for expansion. The pattern (P, Q, , R) shows notably high MI values for cause, progression and expansion. The pattern (P, , R) is closely associated with direct contrast, having an MI value of 5.2049. The connection between the pattern (P, Q, ) and the expansion is highly significant, with an MI value of 10.5881. The pattern (P, ) is closely associated with both the direct contrast and the indirect contrast sense types. From the perspective of semantic classification, it is noted that condition, progression, and coordination are not highly correlated with specific internal semantic element patterns, whereas the remaining categories show a strong correlation with specific internal semantic element patterns.