Multi-Task Sequence Tagging for Denoised Causal Relation Extraction

Zhang, Yijia; Liu, Chaofan; Zhu, Yuan; Chen, Wanyu

doi:10.3390/math13111737

Open AccessArticle

Multi-Task Sequence Tagging for Denoised Causal Relation Extraction

¹

College of Electronic Countermeasures, National University of Defense Technology, Hefei 230037, China

²

College of Computer Science and Technology, Jilin University, Changchun 130012, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

^‡

These authors are co-first authors of the article.

Mathematics 2025, 13(11), 1737; https://doi.org/10.3390/math13111737

Submission received: 21 April 2025 / Revised: 12 May 2025 / Accepted: 22 May 2025 / Published: 24 May 2025

Download

Browse Figures

Versions Notes

Abstract

:

Extracting causal relations from natural language texts is crucial for uncovering causality, and most existing causal relation extraction models are single-task learning-based models, which can not comprehensively address attributes such as part-of-speech tagging and chunk analysis. However, the characteristics of words with multi-domains are more relevant for causal relation extraction, due to words such as adjectives, linking verbs, etc., bringing more noise data limiting the effectiveness of the single-task-based learning methods. Furthermore, causalities from diverse domains also raise a challenge, as existing models tend to falter in multiple domains compared to a single one. In light of this, we propose a multi-task sequence tagging model, MPC−CE, which utilizes more information about causality and relevant tasks to improve causal relation extraction in noised data. By modeling auxiliary tasks, MPC−CE promotes a hierarchical understanding of linguistic structure and semantic roles, filtering noise and isolating salient entities. Furthermore, the sparse sharing paradigm extracts only the most broadly beneficial parameters by pruning redundant ones during training, enhancing model generalization. The empirical results on two datasets show 2.19% and 3.12% F1 improvement, respectively, compared to baselines, demonstrating that our proposed model can effectively enhance causal relation extraction with semantic features across multiple syntactic tasks, offering the representational power to overcome pervasive noise and cross-domain issues.

Keywords:

causal relation extraction; sequence tagging; multi-task learning

MSC:

68T50

1. Introduction

Causal relation extraction (CE), as a crucial task in natural language processing (NLP), plays an indispensable role in various downstream applications, such as knowledge graphs [1], event logic graphs [2], and question-answering systems [3,4]. Extracting causal relations is quite challenging since it requires sophisticated models capable of capturing rich semantic information and complex linguistic phenomena [5]. Prior works utilize the sequence tagging task to complete the CE, while these methods achieve much better performance than traditional methods. Pre-trained language models such as ElMo and BERT are introduced to enhance the understanding capability of the models [6,7] hence improving model performance.

However, these methods are all single-task-based and tend to suffer from the problem of noise information [8]. For instance, certain words in a text, such as adjectives and prepositions, are unlikely to be labeled as “Cause” or “Effect” because they do not point to concrete entities. Yet, such noise information complicates the tagging of cause-and-effect entities. To investigate the impact of noise data, we conducted experiments on the SemEval-2010 Task 8 dataset, Event StoryLine corpus, and Causal TimeBank, with the results shown in Table 1. We selected four illustrative samples, each labeled by existing models across different datasets. In the first two samples from the SemEval-2010 Task 8 dataset [9], BERT incorrectly labels “blind” as “B-Cause”, despite it being an adjective. A similar issue occurs in the Causal TimeBank [10]. Moreover, in the Event StoryLine [11] sample, the preposition “off” is mistakenly labeled as “B-Cause”. This problem could be mitigated by incorporating parts of speech (POS) and other linguistic information. Therefore, we conclude that conventional single-task learning (STL)-based architectures often make erroneous predictions regarding the positions of cause–effect entities, indiscriminately labeling noisy adjectives and prepositions as “Cause” or “Effect”.

In the field of natural language processing, there are various tasks that can process and analyze text from different perspectives [12,13,14]. At present, the main causal extraction methods are single-task methods, which cannot fully utilize the characteristics of different aspects of text for causal relationship extraction. To address the limitations of single-task learning (STL) strategies, there is an urgent need for models to learn various aspects of information from texts. Inspired by the progress of multi-task learning (MTL), we introduce a framework that can learn much information across related tasks to enhance the performance of the models [15,16]. Multi-task learning emerges as a promising approach to enhance model performance by training models on multiple tasks simultaneously, and the inherent advantage of MTL lies in its ability to leverage shared knowledge and complementary information among different but related tasks, thereby alleviating the above problems of the STL-based methods [17,18]. Accordingly, we intend to employ the power of MTL to enhance CE performance. We prefer part-of-speech tagging (POS tagging) [19] and chunk analysis (Chunk) [20] as the co-training tasks. (1) POS tagging is shown to assign grammatical tags to words in a sentence, and most causes and effects are nouns (“NN” in POS tagging) [21], as cause and effect entities project to specific objects. Beyond question, this can help the model exclude other labels while locating causes and effects. (2) Chunk involves grouping words in a sentence into meaningful chunks, splitting several spans of distinct semantic components, which is also beneficial for minimizing the scope of cause and effect based on POS results [20].

To implement multi-task sequence tagging, we adopt a sparse sharing strategy in MTL for more efficient parameter sharing. Parameter sharing in MTL intends to update common parameters during training; typical methods include hard sharing, soft sharing, and hierarchical sharing [22,23,24]. Among them, hard sharing allows all tasks to share the same hidden space, which limits the performance ability of different tasks and makes it difficult to handle loosely related tasks. The soft sharing method does not need to consider the relevance of tasks, but it requires training a model for each task, so the parameters are hyper-parameters. Hierarchical sharing allows each task to only share a portion of the model, and task-specific modules can handle heterogeneous tasks, but designing an effective hierarchical structure is often time-consuming and requires expert experience. Compared to other strategies, sparse sharing can achieve equivalent results compared to other parameter-sharing strategies, while requiring fewer parameters [25,26,27].

Based on the above discussion, we propose a multi-task joint POS tagging and Chunk model for CE named MPC−CE, which can denoise the interference factors in massive semantic components, thus helping the model locate cause–effect entities. However, it raises a new challenge to annotate POS and chunk labels on the basis of existing CE datasets. According to the annotating rule of POS tagging and Chunk, we label a given sentence with three types of labels, as shown in Table 2. MPC−CE combines the three subnets to optimize each task’s target, respectively. Additionally, we conduct experiments on two distinct datasets, namely the SemEval-2010 Task 8 dataset and MTL-CE dataset, to showcase the efficacy of our proposed MPC−CE. Through these experiments, we aim to demonstrate the superior performance of MPC−CE compared to various baseline models. The empirical results unequivocally highlight the remarkable effectiveness and potential of MPC−CE in pushing the boundaries of CE performance to new heights. In summary, our contribution to this work is four-fold:

We utilize the POS tagging and Chunk task to help capture the attribute of each word, which can alleviate the data noise problem in causal relation extraction.
We propose MPC−CE–a multi-task learning model for causal relation extraction to unify POS tagging and Chunk into a single sequence-labeling task.
We incorporate a sparse sharing mechanism to reduce the scale of training parameters through iterative pruning, which can facilitate efficient pruning of shared layers.
We conduct groups of experiments on both open datasets and our self-merge datasets, and the results demonstrate that MPC−CE gains the most improvement among all baseline models under the cross-domain scenario.

Table 2. A multi-label data sample from SemEval2010-Task 8 dataset; each word in the input sentence is annotated with three types of labels: POS, Chunk, and CE.

Sentence

Muscle

fatigue

is

the

number

one

cause

of

arm

muscle

pain

.

POS

NNP

NN

VBZ

DT

NN

CD

NN

IN

JJ

NN

.

Chunk

O

B-NP

O

B-NP

I-NP

O

B-NP

O

B-NP

I-NP

B-NP

O

CE

O

B-Cause

O

B-Effect

O

2. Related Work

Early studies attempted to identify causal relations by locating the cause and effect entities that preceded and succeeded explicit causal connectives such as “because”, “so”, and “result in”, requiring significant rule building and feature engineering [28]. Later, dozens of complicated deep neural network derivatives obtained more attraction, as they can automatically extract causal relations from a large amount of labeled data [29,30]. Existing architectures of CE models can be viewed in two categories, STL and MTL, and there are several parameter-sharing strategies within MTL.

2.1. STL Strategy

Traditional pipeline methods divide the relation extraction (RE) task into two steps: named entity recognition (NER) [31] and RE [32]. This means these two tasks are executed sequentially, and the result of the latter task depends on the former task. Joint entity and RE, on the other hand, extract entities from the unstructured text while simultaneously identifying their semantic relationships. Ref. [33] proposes a knowledge-enhanced CE framework using special ontologies as a guide. Nevertheless, they also require additional effort in creating and maintaining knowledge sources, which can limit their applicability in general contexts. Recent studies explore pre-trained language models like BERT to enhance the quality of extracted causal relations by improving the representation of input text [34,35]. Moreover, other researchers have explored using graph-based models that incorporate structural information among entities and relations to improve CE [36]. Regretfully, all of them ignore the semantic knowledge that can be learned from other tasks.

2.2. MTL Strategy

Compared to STL, MTL-based models can facilitate interactions among relevant tasks [37]. ECPE [38] incorporates position-aware emotion information to leverage contextual emotion information for CE. Ref. [39] encodes the distances between emotions and their associated causes into a novel tagging scheme, allowing for more explicit modeling of their information interaction. However, these methods may struggle with complex texts that have intricate structures, and their performance drops when removing the extra annotated information. Ref. [40] proposes a new inter-document CE model for long and complex news articles using story-tree information to formulate constraints in an Integer Linear Programming (ILP) model. Ref. [41] uses Statement Biological Expression Language (SBEL) instances to extract causality from the biomedical literature by integrating RE and function detection.

2.3. Parameter-Sharing Strategies

In the quest to enhance the effectiveness of multi-task learning models, various strategies have been proposed as solutions [22,23,24]. One such strategy is hard sharing, in which the shared parameters of the model are designed to capture features that are common across all the tasks. This can lead to models that are more robust and can work well in different tasks. However, if the shared parameters are too general and do not account for the specific characteristics of each task, the model might end up fitting too closely to the training data and not generalize well [17,42]. On the other hand, only a selected group of parameters is shared among tasks, and each task receives its own unique set of weights in soft sharing. Although this method has the potential to deal with the complexity of different tasks, it requires careful tuning of certain settings to work effectively [43,44]. Additionally, hierarchical sharing arranges parameters in layers, with deep-layer parameters shared among tasks and shallow-layer parameters specific to each task. While this can offer a balance between common and task-specific features, it is not always easy to design and might not work well with cross-domain datasets [45,46,47].

Despite the progress in these approaches, challenges still exist in extracting implicit causal relationships and dealing with noisy text. Therefore, more advanced approaches are needed to capture as much beneficial information for causal relations as possible.

3. Model

Our proposed framework MPC−CE consists of 3 base modules, which will be illustrated in later parts; it employs a sparse sharing mechanism [27], including two crucial components: a task-specific layer and a task-sharing layer, as depicted in Figure 1. Task-specific layers extract features specific to a particular task, while task-sharing layers extract common features used to make predictions for all tasks. By sharing complementary information across these tasks, MPC−CE can learn to denoise irrelevant semantic components amongst interference factors, and hence perform better under complicated CE circumstances efficiently; it can also handle single-source noise data and improve the model’s performance.

As cause entities and effect entities typically manifest as nouns and chunks, POS tagging and Chunk are employed as auxiliary tasks for CE. The shared layer of the model utilizes BERT [7] and BiLSTM layers. Given the large number of parameters in the pre-trained BERT, we adopt iterative pruning [48] to minimize the model’s size by pruning parameters that contain “attention” and “lstm”. The 3 tasks generate their subnets individually, and the parameters of each task subnet are partially shared to promote positive transfer learning while mitigating potential over-parameterization issues. Furthermore, each task is equipped with its specific layer, i.e., a linear layer and CRF [49] layer, to predict and validate the label’s correctness.

3.1. Sharing Layer

3.1.1. BERT Layer

The primary function of BERT is to map each word in the sentence to a contextual information vector dynamically based on the context, due to the polysemy problem of causality sentences in the text. Suppose the input sequence is denoted as

x = (x_{1}, x_{2}, \dots, x_{m}),

(1)

where m is the number of words in the sentence. Following the BERT model, the output sequence is represented as

t = (t_{1}, t_{2}, \dots, t_{m}) .

(2)

This mapping process involves taking into account various factors such as polysemy, the syntactic features of sentences, and so on. To reduce the number of parameters in the model, we prune the parameters containing “attention” in this layer.

3.1.2. BiLSTM Layer

At this layer, we prune the parameter containing “lstm”. For each input sequence of the BiLSTM module, both a forward and a backward LSTM need to encode it into hidden representations, and we concatenate the two directional hidden states to produce a complete sequence, which helps capture the positional information of each word within the sentence. From this, we obtain the output sequence of the forward LSTM hidden state,

\vec{h} = (\vec{h_{1}}, \vec{h_{2}}, \dots, \vec{h_{m}}),

(3)

as well as the output sequence of the backward LSTM,

\overset{\leftarrow}{h} = (\overset{\leftarrow}{h_{1}}, \overset{\leftarrow}{h_{2}}, \dots, \overset{\leftarrow}{h_{m}}) .

(4)

These two vectors are combined to obtain the complete output sequence h of the BiLSTM hidden state as follows:

h_{t} = [\vec{h_{t}}; \overset{\leftarrow}{h_{t}}] \in R^{n},

(5)

h = (h_{1}, h_{2}, \dots, h_{m}) \in R^{m \times n},

(6)

where t denotes the position of the word in the sentence, m denotes the number of words in the sentence, and n denotes the dimension of the vector.

3.2. Special Layer

3.2.1. Linear Layer

Following the BiLSTM layer, a linear layer is connected, which maps the hidden state vector h, comprising n dimensions, to k dimensions. The linear layer facilitates the extraction of sentence features by employing a matrix, represented as P, as illustrated in this formula:

P = (p_{1}, p_{2}, \dots, p_{m}) \in R^{m \times k} .

(7)

where k is the number of labels, depending on the specific task. For instance, the case of the CE task involves 5 distinct labels, including “B-Cause”, “I-Cause”, “B-Effect”, “I-Effect”, and “O”. The linear layer’s objective is to transform the hidden state vector of n dimensions to a more compressed vector of k dimensions, enabling the effective extraction of high-level features. By minimizing information loss during this process, the linear layer facilitates the representation of the sentence in a more informative and meaningful way, thereby improving the performance of all co-training tasks.

3.2.2. CRF

The Conditional Random Field module [49] aims to tag the label for each word in a given sequence according to an overall maximum probability computation, which can be employed in copious sequence tagging tasks [49]. For an input sentence

X = (x_{1}, x_{2}, x_{3}, . . ., x_{n})

and target tagging sequence

Y = (y_{1}, y_{2}, y_{3}, . . ., y_{n})

, where

x_{i}

is the i-th word of X,

y_{i}

is the tag of

x_{i}

, and n is the length of both the input and target sequences, we define the score of the whole tagging as follows:

s (X, Y) = \sum_{i = 0}^{n - 1} A_{y_{i}, y_{i + 1}} + \sum_{i = 1}^{n} P_{i, y_{i}}

(8)

where

s (\cdot)

compute the score of a given tagged sequence, and

P \in R^{n \times k}

is the emit matrix, which denotes the label vector generated from the BiLSTM layer, representing which label should be the current

x_{i}

map.

A \in R^{(k + 2) \times (k + 2)}

is the transfer matrix, storing the score of a transfer from one label to another; the higher the score, the more possible the transfer. k is the number of labels, and

y_{0}

and

y_{n}

are the start and the end of the sentence. We compute the scores of the tagging result from two directions with the Softmax function to obtain the conditional probability:

p (Y | X) = \frac{e^{s (X, Y)}}{\sum_{\tilde{Y} \in Y_{X}} e^{s (X, \tilde{Y})}}

(9)

where

Y_{X}

denotes all possible label sequences. To decrease the cost of computation, we use its logarithm value as follows:

l o g p (Y | X) = s (X, Y) - l o g \sum_{\tilde{Y} \in Y_{x}} e^{s (X, \tilde{Y})}

(10)

Finally, we can obtain the optimal predicted sequence tagging result through the maximum likelihood estimate (MLE). Then, the model can output a more reasonable labeled sequence with the CRF’s learning ability on dependency among labels, avoiding the unstable issue of tagging orders.

3.3. Sparse Sharing in MPC−CE

Inspaired by [27], we adopt the sparse sharing strategy. The basic process of multi-task learning training based on a sparse sharing mechanism is as follows: Firstly, iterative magnitude pruning (IMP) is used to induce sparsity in the network, obtaining a subnet for each task. Then, multiple task subnets are trained in parallel, and during the training process, each task only updates the weights of its corresponding subnet. Closely related tasks tend to extract similar subnets, which can use similar weights, while loosely or unrelated tasks tend to extract different subnets. Ref. [27] demonstrated that using sparse sharing can enable multiple natural language processing tasks to complement each other.

In the process of generating subnetworks for each task, we independently perform iterative pruning on each task to obtain the Mask matrix corresponding to each task, thus obtaining the subnetworks for each task; we choose the subnetwork that performs the best on the validation set [27,48], which means the best causal relation extraction results. The underlying network of the model, denoted as

ξ

, serves as the base network, with the corresponding model parameters represented by

θ_{ξ}

. The masking matrix, denoted as

M_{t}

, assumes binary values of 0 or 1, where 0 signifies parameter masking and 1 signifies parameter reservation.

During the training process, we iteratively prune the parameters associated with “lstm” and “attention” for each of the three tasks. This allows us to obtain the

M a s k

matrix specific to each task, as well as the

ξ

and

M a s k

matrices for the base network. The parameters of the subnet corresponding to task t are then obtained through element-wise multiplication of

M_{t}

and

θ_{ξ}

, resulting in the subnet structure expressed as

ξ^{t} = ξ (M_{t} ⊙ θ_{ξ})

(11)

To effectively reduce the parameter scale, we employ the iterative order of the magnitude pruning technique. This technique entails iteratively pruning the model parameters associated with “lstm” and “attention” during the training process. Specifically, we perform n iterations of pruning, executing one pruning operation after each training epoch, and subsequently generating a subnet for the corresponding task. The pruning rate for each iteration is calculated using the following formula:

p = {(1 - α)}^{\frac{1}{n}}

(12)

Here, p represents the pruning rate for each iteration, while

α

signifies the percentage of parameters retained in the final model.

The IMP technique enables the pruning of weight values that fall below a predefined threshold. This process effectively reduces the number of parameters and computational requirements, while concurrently improving the efficiency of neural network training without compromising accuracy. By selectively pruning the network parameters, we obtain a compact representation of the network, facilitating the construction of subnets tailored to specific tasks. This, in turn, enables our model to learn in a more efficient and effective manner, while still maintaining a high level of accuracy on the tasks at hand.

4. Experiment

In this section, we will provide a detailed exposition of the dataset that we utilized for all our experimental evaluations, followed by a comprehensive account of the baseline models. Then, we shall furnish an exhaustive analysis of the results obtained from both the single-task and multi-task learning paradigms. We also conducted ablation studies to scrutinize the efficacy of our proposed methodology, wherein we performed a comprehensive assessment of the impact of individual components on the overall performance of the MPC−CE. Furthermore, we evaluate the CE performance fluctuations under different hyper-parameter settings and the same for co-training tasks. Specifically, we endeavor to address the following research inquiries:

RQ1:: What is the significance of multi-task for the efficacy of different tasks?
RQ2:: What is the significance of Pos and Chunk tasks for causal relation extraction?
RQ3:: What influence does pruning have on the efficacy of different tasks?
RQ4:: How do different hyper-parameters contribute to the efficacy of different tasks?

4.1. Datasets

As aforementioned, dataset domains affect the models’ final performance a lot. The SemEval-2010 Task 8 dataset [9] is the most commonly used single-domain CE dataset, and considering the various types of causalities, we intend to combine the Event StoryLine v1.0 [11] and Causal-TimeBank [10] into a new CE dataset called MTL-CE, so as to imitate the cross-domain causality scenario. In order to meet the dataset requirements of multi-task learning, we employed the NLTK to annotate POS and chunk labels on all the CE datasets since it can achieve high accuracy on both POS tagging and Chunk. We evaluated our model on the SemEval-2010 Task 8 dataset and MTL-CE datasets. The training set, validation set, and test set were divided according to the ratio of 7:1.5:1.5.

SemEval-2010 Task 8 dataset. The SemEval-2010 Task8 [9] dataset is used for the multi-way classification of mutually exclusive semantic relations between pairs of nominals, and consists of a total of 10,717 sample data points, and 8000 and 2717 samples for training and testing, respectively. It contains nine directional relations and one other relationship, such as Cause–Effect, and Message–Topic, in which the cause-and-effect entities are annotated, making it a popular dataset for CE. However, the majority of samples are general causal relations from a single source.

MTL-CE dataset. As previously stated, we merged two distinct datasets, namely Causal Timebank [10] and Event StoryLine, into a newly created one to aggregate more causal data in this new dataset. Causal TimeBank is an annotated dataset on causal relations derived from the original Tempeval-3 TimeBank, whose causal relations are temporal causalities. The Event StoryLine v1.0 dataset [11] comprises 566 data samples, most of which are causal relations from the news. Therefore, these two datasets are from distinct domains, which is quite suitable for constructing the cross-domain dataset. Finally, our self-made MTL-CE dataset consisted of 3482 data samples, with an equal number of causal and non-causal samples in a 1:1 ratio. The dataset’s statistics are presented in Table 3.

4.2. Metrics

For the POS tagging task, since part-of-speech tagging aims to tag each word, it does not involve the word boundary problem, so the accuracy is used for evaluation. The accuracy metric (ACC) is a facile and perceptible evaluation technique that assesses the proportion of correct predictions generated by a model. It is straightforward to comprehend and furnishes a swift outline of the model’s performance.

As for Chunk and CE, the annotated boundary involves multiple words, the precision (P), recall (R), and F1 score (F1) were adopted for evaluation metrics. Precision represents the fraction of true positive predictions among all positive predictions, while recall denotes the proportion of true positive predictions among all actual positive instances. F1 serves as the harmonic mean of precision and recall, delivering a balance between the two metrics, and is especially valuable when tackling datasets with a skewed distribution of positive and negative examples.

4.3. Baselines

We investigated many CE models and compared the efficacy of MTL methods in our experiments. In particular, we considered the following baselines:

CNN+BiLSTM: Ref. [50] presents this multi-task learning method that combines CNN and BiLSTM models to perform multiple NLP tasks simultaneously.
BERT+BiLSTM+MTL: Based on CNN+BiLSTM, we substitute the CNN module with BERT to perform multiple tasks.
BERT+BiLSTM+CRF+MTL: We add CRF after BERT+BiLSTM+MTL in-depth to make the model more complete.

We compare a range of STL-based and MTL-based methods for CE, so we can identify the most effective method for this task and provide insights for future research in this area.

4.4. Implementation

We conducted several experiments on the SemEval-2010 Task 8 and MTL-CE datasets with the following hyper-parameter settings: the iteration pruning rate was set to 10, the final parameter reserve rate was 0.2, the learning rate was 1 × 10⁻⁵ on the SemEval-2010 Task 8 and 5 × 10⁻⁵ on the MTL-CE, and Adam was chosen as the optimizer. For BiLSTM, we set its hidden size as 256, and the dropout rate was 0.5 to alleviate overfitting. The batch size was 32, the number of iterations was 50, and every epoch was evaluated with the validation set with each epoch carried out.

4.5. Multi-Task Learning Performance (RQ1)

To answer RQ1, we first conducted experiments to explore the performance of the multi-task learning framework for different tasks. The results are presented in Table 4 and Table 5. In this experiment, we conducted 10 prunings to eliminate any redundant ‘lstm’ and ‘attention’ parameters for the MPC−CE. Subsequently, each pruning generates a subnet, and the three tasks select the subnetwork with the optimal performance on the test set as the subnetwork for multi-task training.

Overall, the proposed MPC−CE model achieved the best performance compared with the other multi-task based methods for casual relation extraction (CRE). It obtained the best F1 metrics on the SemEval-2010 Task 8 dataset and MTL-CE dataset. These results demonstrate the efficacy of MPC−CE in boosting the performance of the CRE task. In addition, MPC−CE performs on par with other baselines in POS tagging, while a little worse than other baselines in Chunk. We attribute this minor issue to the errors of NLTK-based chunk labels. In a comparison with the BERT+BiLSTM+CRF model, we can observe that MPC−CE outperforms BERT+BiLSTM+CRF (1.54% and 0.71%) for the CRE task on the SemEval-2010 Task 8 dataset and MTL-CE dataset, respectively. The results demonstrate that the sparse sharing mechanism introduced in our model can help the model improve the CRE performance. By effectively harnessing the benefits of parameter sharing, MPC−CE not only outshines various baseline methods in terms of training speed but also paves the way for streamlined and time-efficient model learning.

In order to verify multi-task learning compared with single-task learning, we also conducted experiments with single-task learning for these baselines aiming at CRE. The results are shown in Table 6 and Table 7. The performance improvements of multi-task learning compared to single-task learning are shown in Table 4 and Table 5. First, we can observe that BERT+BiLSTM+CRF has the best performance on the two datasets in the case of single-task learning. This observation demonstrates that the BERT+BiLSTM+CRF framework adapted by our model has advantages in the CRE task. Second, we also observed that our proposed model achieves the more significant precision metric, which demonstrates that our model can effectively reduce the possibility of misjudgment, proving the ability of multi-task learning from the perspective of denoising.

We also observed that CNN-based models perform much better on the MTL-CE dataset in terms of performance on the SemEval-2010 Task 8 dataset in the case of single-task learning. On the contrary, BERT-based models performed better on the SemEval-2010 Task 8 dataset. These results may be due to two reasons: (1) The MTL-CE dataset contains cross-domain information; studying cross-domain text semantics can promote the model’s capacity to distinguish various semantics, thus enhancing model performance. Additionally, we noticed that the enhancement in model performance is primarily mirrored in the recall metric, which could possibly be ascribed to the fact that cross-domain information can enhance the model’s fitting ability and further boost its recognition capability for real casual relations. (2) As a pre-trained language model, the BERT architecture exhibits a superior capacity to learn text semantics compared to the CNN approach, resulting in more sufficient performance on two datasets for BERT-based models. Nevertheless, the two variants of BERT-based models in the table demonstrate enhanced efficacy on the SemEval-2010 Task 8 dataset, a phenomenon that may be attributed to the cross-domain information present in the MTL-CE dataset, which induces a disparity in causal distribution between the training and testing sets. When deploying the intricate BERT model, the potential for overfitting emerges, wherein certain noisy data are misidentified as causal relationships, consequently decreasing the model performance for the CRE task.

Although our method did not achieve the best performance for POS and Chunk tasks, this is because our model framework is designed for CRE. For the task of CRE, our model achieves the best performance, which proves that our proposed multi-task framework is effective for the CRE task.

4.6. Task Analysis (RQ2)

To answer RQ2, we conducted an ablation study to investigate the effect of the Pos and Chunk task for CRE. The results are shown in Table 8 and Table 9. It can be observed that the combined learning of two tasks resulted in a slight decrease in performance compared with the combination of three tasks, indicating the need to adopt a joint learning strategy for three tasks. Additionally, we discovered that the performance degradation of BERT-based models was more evident, indicating that the BERT model is more likely to benefit from multi-task learning for capturing semantic information. This is probably due to the multi-task learning technique we adopted leveraging sparse sharing to prune network parameters in the attention mechanism, resulting in a notable effect on the BERT model. The results also show that the CRE task achieves better performance when trained together with the Pos task, in contrast to the Chunk task, highlighting that the Pos task plays a more crucial role for improving the CRE performance. Such an observation could be attributed to the fact that the Pos task mainly focus on discerning the part of speech for a specific word, which provides essential support for the CRE task. Through classifying the parts of speech of words, the noise data can be dealt with at the word level for the CRE task. For example, some parts of speech cannot act as causal entities, the joint learning of Pos and CRE tasks can help the sequence tagging model correctly determine the parts of speech for causal entities, thus maximizing the utility of part-of-speech information to reduce the impact of unrelated words. In addition, we found that the Chunk task is based on the part-of-speech tagging task, which is more elaborate than the Pos task. Therefore, CRE and Chunk are two relatively sophisticated tasks; joint learning for these two tasks will increase the challenge of learning. Also, leveraging part-of-speech tagging outcomes as input in chunk analysis tasks might introduce some incorrect part-of-speech tagging results, consequently producing noisy data and restraining performance optimization during the joint training of Chunk and CRE tasks. Furthermore, we observe that the model exhibits a relatively robust performance on POS tasks with the training of different tasks combinations. This reliability can be ascribed to the elementary aspects of the Pos task, which are less prone to be influenced by noise data and complex semantic information, leading to a stable overall performance. The model’s performance on the Chunk task was also found to be more consistent compared to the CRE task. This might be because the CRE task demands a deeper understanding of textual semantics, thus being more exposed to the effects of noise data. These findings also suggest that multi-task learning is more fitting for the CRE task.

4.7. Ablation Performance (RQ3)

In this study, we introduced sparse sharing into multi-task learning, initially conducted parameter pruning for each task, and estimated the model performance. Then, we obtained the subnet of the sequence tagging model that offers the most optimal performance for multi-task learning. To evaluate the efficacy of the sparse sharing strategy for the CRE performance, we compare the model performance of pre-pruning and post-pruning during multi-task learning, the results of which are shown in Table 10. These results reveal that the joint learning of the three tasks substantially enhances the model’s overall performance. Notably, the model exhibits a more substantial performance enhancement for the CRE task in comparison to the other two tasks, which demonstrates that the sparse sharing mechanism is more adaptive to the CRE task.

In addition, before the implementation of multi-task learning, the pruning rate that yields the best performance differs across the three tasks. It is crucial to highlight that the pruning rates for the Pos task on two datasets are larger than that for the Chunk task. This phenomenon could potentially be explained through the Pos task being more easily affected by excessive parameters, and the excessive parameters will cause the model to be overfitted due to the simplicity of the task. Our investigation also reveals that on the SemEval2010-Task 8 dataset, the pruning ratios associated with both POS and Chunk tasks are elevated in comparison to their respective rates on the MTL dataset. These results could potentially be ascribed to the incorporation of cross-domain information within the MTL-CE dataset. This dataset is characterized by the concurrent presence of words and sentences exhibiting notable semantic differences. Consequently, the model exhibits a diminished propensity for overfitting on these two tasks when utilizing the MTL dataset, thereby leading to a reduced pruning ratio. The performance accuracy of the CRE task across two datasets remains relatively similar, which is probably attributed to the necessity for the model to capture more profound contextual semantics in causal extraction and to be more easily influenced by the noise data problem. Consequently, even in the MTL datasets encompassing specific cross-domain information, the incidence of overfitting issues remains notably prevalent.

Our findings also suggest that the performance of the model before pruning is worse than that of the model without pruning for single-task learning. This observation could be elucidated by the fact that the model deploys a more substantial number of parameters to comprehensively capture text semantics in the case of single-task learning, which contains more useful parameters than the model with parameter pruning. However, pruning is essential for the CRE task in the scenarios of multi-task learning, which is probably due to the settings for multi-task learning and the neglect of pruning for each task’s redundant parameters, thereby causing mutual interference among different parameters. These results indicate that both pruning and multi-task learning can effectively improve the performance of the model in causal relationship extraction tasks. Moreover, the observation of distinct metrics demonstrates that the model’s performance improvements for the precision metric are more significant after multi-task learning for three tasks, implying that multi-task learning can capably eliminate noisy data and reduce the model’s misjudgment for positive samples during the process of causal extraction. However, multi-task learning may simultaneously remove some useful information, leading to the misidentification of certain positive samples as negative, so that the advancement in the recall metric is not as remarkable as that in the precision metric.

4.8. Parameter Analysis (RQ4)

To answer RQ4, we conducted experiments to evaluate the effectiveness of our model with different types of parameters, including the number of iterative prunings and epochs.

Effect of the number of iterative prunings. Before implementing multi-task learning, we generate several subnets for each task using parameter pruning, followed by the selection of an optimal number of iterative prunings for multi-task training. In an effort to understand the effect of the number of iterative prunings on the model efficacy for different tasks, we conducted an experiment to explore the relation between the number of iterative prunings and the model’s performance on two datasets; the results are shown in Figure 2. From the results, we can observe the following: (1) On the SemEval-2010 Task 8 dataset, our model achieved the best performance for POS, Chunk, and CRE tasks when the number of iterative prunings was set as 1, 3, and 4 respectively. (2) On the MTL-CE dataset, our model achieved the best performance for POS, Chunk, and CRE tasks when the numbers of iterative prunings were the same as before. The experimental findings indicate that an additional number of iterations might be needed for extensive cross-domain information, thus enhancing the learning of parameters that require pruning.

Effect of the number of epochs. We conducted experiments to analyze the convergence for three distinct tasks. Figure 3 illustrates the performance trend changes with increasing epochs. Firstly, the observations reveal that both the Chunk and CRE tasks converge at approximately 20 epochs on the SemEval-2010 Task 8 dataset, whereas they converge at around 30 epochs on the MTL-CE dataset. In contrast, the Pos task converges in less than 10 epochs on the SemEval-2010 Task 8 dataset, and at bout 10 epochs on the MTL-CE dataset. These findings indicate that all three tasks converge more rapidly on the SemEval-2010 Task 8 dataset compared to the MTL-CE dataset. This difference in convergence rates is probably attributed to the MTL-CE dataset containing more cross-domain information, making the text semantics in this dataset more conpelex; thus, additional epochs for the model are needed to capture the semantic information. Secondly, the convergence trend of the Chunk and CRE tasks exhibits greater variability compared to the Pos task. This trend is possibly due to the Chunk and CRE tasks requiring more stringent demands on semantic analysis, making them more sensitive to the interferences induced by the noisy data.

Analysis of the NLTK. MPC−CE performs on par with other baselines in POS tagging, while a little worse than other baselines in Chunk. We employed the complete NLTK for the CE task and compared it with other baselines; the results are displayed in Table 11. Given the results of the NLTK and other baselines, we attribute this minor issue to the errors of NLTK-based chunk labels. Note that MPC−CE achieves a lower performance than the NLTK to a limited margin, which illustrates that MPC−CE can perform better if it can be provided with golden labels.

5. Conclusions

In this paper, we propose a multi-task learning framework, MPC−CE, that leverages a sparse sharing mechanism, allowing for partial parameter sharing among POS tagging and Chunk and aiming to achieve denoised causal relation extraction. By jointly training causal relation extraction and two other sequence-labeling tasks, we mitigate the challenge of causal relation extraction in noisy data as well as cross-domain corpora. Furthermore, we accelerate the training speed of the model by pruning the overlapping parameters of shared modules while maintaining high accuracy. Our evaluations of two datasets demonstrate that MPC achieves significant improvement compared to single-task learning models, establishing efficient learning. In the future, we will delve deeper into the complex causal relationship extraction problem through multi-task learning and investigate cross-sentence and cross-paragraph scenarios.

Author Contributions

Methodology, Y.Z. (Yijia Zhang) and Y.Z. (Yuan Zhu); writing—original draft, Y.Z. (Yijia Zhang); writing—review and editing, C.L. and W.C.; investigation, C.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research is supported by the National Natural Science Foundation of China (62302511).

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

The data we used in our work can be found from open source data sources, and the data links have been provided in our paper. Our code will be released if the paper was received.

Conflicts of Interest

The authors declare no conflicts of interest. The founder played a role in the writing of the article.

References

Ravichandran, D.; Hovy, E. Learning surface text patterns for a question answering system. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, PA, USA, 6–12 July 2002; pp. 41–47. [Google Scholar]
Ding, X.; Li, Z.; Liu, T.; Liao, K. Elg: An event logic graph. arXiv 2019, arXiv:1907.08015. [Google Scholar]
Oh, J.-H.; Torisawa, K.; Hashimoto, C.; Sano, M.; Saeger, S.D.; Ohtake, K. Why-question answering using intra-and inter-sentential causal relations. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Sofia, Bulgaria, 4–9 August 2013; pp. 1733–1743. [Google Scholar]
Nickel, M.; Murphy, K.; Tresp, V.; Gabrilovich, E. A review of relational machine learning for knowledge graphs. Proc. IEEE 2015, 104, 11–33. [Google Scholar] [CrossRef]
Blanco, E.; Castell, N.; Moldovan, D. Causal relation extraction. In Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC’08), Marrakech, Morocco, 28–30 May 2008. [Google Scholar]
Sarzynska-Wawer, J.; Wawer, A.; Pawlak, A.; Szymanowska, J.; Stefaniak, I.; Jarkiewicz, M.; Okruszek, L. Detecting formal thought disorder by deep contextualized word representations. Psychiatry Res. 2021, 304, 114135. [Google Scholar] [CrossRef] [PubMed]
Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2019, arXiv:1810.04805. [Google Scholar]
Lewis, M.; Liu, Y.; Goyal, N.; Ghazvininejad, M.; Mohamed, A.; Levy, O.; Stoyanov, V.; Zettlemoyer, L. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv 2019, arXiv:1910.13461. [Google Scholar]
Hendrickx, I.; Kim, S.N.; Kozareva, Z.; Nakov, P.; Séaghdha, D.Ó.; Padó, S.; Pennacchiotti, M.; Romano, L.; Szpakowicz, S. Semeval-2010 task 8: Multi-way classification of semantic relations between pairs of nominals. In Proceedings of the 5th International Workshop on Semantic Evaluation, Uppsala, Sweden, 15–16 July 2010; pp. 33–38. [Google Scholar]
Mirza, P.; Sprugnoli, R.; Tonelli, S.; Speranza, M. Annotating causality in the tempeval-3 corpus. In Proceedings of the EACL 2014 Workshop on Computational Approaches to Causality in Language (CAtoCL), Gothenburg, Sweden, 26 April 2014; pp. 10–19. [Google Scholar]
Caselli, T.; Vossen, P. The event storyline corpus: A new benchmark for causal and temporal relation extraction. In Proceedings of the Events and Stories in the News Workshop, Vancouver, BC, Canada, 4 August 2017; pp. 77–86. [Google Scholar]
Zhao, S.; Tuan, L.A.; Fu, J.; Wen, J.; Luo, W. Exploring clean label backdoor attacks and defense in language models. IEEE/ACM Trans. Audio Speech Lang. Process. 2024, 32, 3014–3024. [Google Scholar] [CrossRef]
Hur, A.; Janjua, N.; Ahmed, M. Unifying context with labeled property graph: A pipeline-based system for comprehensive text representation in NLP. Expert Syst. Appl. 2024, 239, 122269. [Google Scholar] [CrossRef]
Shivamurthy, J.; Uppal, T.; Vidyarthi, D. NLP-based Auto Generation of Graph Database from Textual Requirements. In Proceedings of the 2024 IEEE International Conference on Electronics, Computing and Communication Technologies (CONECCT), Bangalore, India, 12–14 July 2024; pp. 1–6. [Google Scholar]
Misra, I.; Shrivastava, A.; Gupta, A.; Hebert, M. Cross-stitch networks for multi-task learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 3994–4003. [Google Scholar]
Yu, T.; Quillen, D.; He, Z.; Julian, R.; Hausman, K.; Finn, C.; Levine, S. Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning. In Proceedings of the Conference on Robot Learning, PMLR, Virtual Event, 16–18 November 2020; pp. 1094–1100. [Google Scholar]
Ruder, S. An overview of multi-task learning in deep neural networks. arXiv 2017, arXiv:1706.05098. [Google Scholar]
Yang, J.; Han, S.C.; Poon, J. A survey on extraction of causal relations from natural language text. Knowl. Inf. Syst. 2022, 64, 1161–1186. [Google Scholar] [CrossRef]
Schmid, H. Part-of-speech tagging with neural networks. arXiv 1994, arXiv:cmp-lg/9410018. [Google Scholar]
Santos, P.J.; Badre, A.N. Automatic chunk detection in human-computer interaction. In Proceedings of the Workshop on Advanced Visual Interfaces, Bari, Italy, 1–4 June 1994; pp. 69–77. [Google Scholar]
Pouyanfar, S.; Sadiq, S.; Yan, Y.; Tian, H.; Tao, Y.; Reyes, M.P.; Shyu, M.-L.; Chen, S.-C.; Iyengar, S.S. A survey on deep learning: Algorithms, techniques, and applications. Acm Comput. Surv. (CSUR) 2018, 51, 1–36. [Google Scholar] [CrossRef]
Li, J.; Li, D.; Xiong, C.; Hoi, S. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In Proceedings of the International Conference on Machine Learning, PMLR, Baltimore, MD, USA, 17–23 July 2022; pp. 12888–12900. [Google Scholar]
Fu, Z.; Yang, H.; So, A.M.-C.; Lam, W.; Bing, L.; Collier, N. On the effectiveness of parameter-efficient fine-tuning. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington DC, USA, 7–14 February 2023; Volume 37, pp. 12799–12807. [Google Scholar]
Tan, Y.; Liu, Y.; Long, G.; Jiang, J.; Lu, Q.; Zhang, C. Federated learning on non-iid graphs via structural knowledge sharing. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington DC, USA, 7–14 February 2023; Volume 37, pp. 9953–9961. [Google Scholar]
Hayakawa, S.; Suzuki, T. On the minimax optimality and superiority of deep neural network learning over sparse parameter spaces. Neural Netw. 2020, 123, 343–361. [Google Scholar] [CrossRef] [PubMed]
Ding, K.; Dong, X.; He, Y.; Cheng, L.; Fu, C.; Huan, Z.; Li, H.; Yan, T.; Zhang, L.; Zhang, X.; et al. Mssm: A multiple-level sparse sharing model for efficient multi-task learning. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual Event, 11–15 July 2021; pp. 2237–2241. [Google Scholar]
Sun, T.; Shao, Y.; Li, X.; Liu, P.; Yan, H.; Qiu, X.; Huang, X. Learning sparse sharing architectures for multiple tasks. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 8936–8943. [Google Scholar]
Khoo, C.S.; Kornfilt, J.; Oddy, R.N.; Myaeng, S.H. Automatic extraction of cause-effect information from newspaper text without knowledge-based inferencing. Lit. Linguist. Comput. 1998, 13, 177–186. [Google Scholar] [CrossRef]
Schölkopf, B.; Williamson, R.C.; Smola, A.; Shawe-Taylor, J.; Platt, J. Support vector method for novelty detection. In Advances in Neural Information Processing Systems 12; MIT Press: Cambridge, MA, USA, 1999. [Google Scholar]
Kambhatla, N. Combining lexical, syntactic, and semantic features with maximum entropy models for information extraction. In Proceedings of the ACL Interactive Poster and Demonstration Sessions, Barcelona, Spain, 21– 26 July 2004; pp. 178–181. [Google Scholar]
Nadeau, D.; Sekine, S. A survey of named entity recognition and classification. Lingvisticae Investig. 2007, 30, 3–26. [Google Scholar] [CrossRef]
Zhang, D.; Wang, D. Relation classification via recurrent neural network. arXiv 2015, arXiv:1508.01006. [Google Scholar]
Zhao, S.; Jiang, M.; Liu, M.; Qin, B.; Liu, T. Causaltriad: Toward pseudo causal relation discovery and hypotheses generation from medical text data. In Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics, Washington, DC, USA, 29 August–1 September 2018; pp. 184–193. [Google Scholar]
Akkasi, A.; Moens, M.-F. Causal relationship extraction from biomedical text using deep neural models: A comprehensive survey. J. Biomed. Inform. 2021, 119, 103820. [Google Scholar] [CrossRef]
Liu, J.; Wei, W.; Chu, Z.; Gao, X.; Zhang, J.; Yan, T.; Kang, Y. Incorporating casual analysis into diversified and logical response generation. arXiv 2022, arXiv:2209.09482. [Google Scholar]
Li, Z.; Li, Q.; Zou, X.; Ren, J. Causality extraction based on self-attentive bilstm-crf with transferred embeddings. Neurocomputing 2021, 423, 207–219. [Google Scholar] [CrossRef]
Zhang, Z.; Yu, W.; Yu, M.; Guo, Z.; Jiang, M. A survey of multi-task learning in natural language processing: Regarding task relatedness and training methods. arXiv 2022, arXiv:2204.03508. [Google Scholar]
Wu, S.; Chen, F.; Wu, F.; Huang, Y.; Li, X. A multi-task learning neural network for emotion-cause pair extraction. In ECAI 2020; IOS Press: Amsterdam, The Netherlands, 2020; pp. 2212–2219. [Google Scholar]
Fan, C.; Yuan, C.; Gui, L.; Zhang, Y.; Xu, R. Multi-task sequence tagging for emotion-cause pair extraction via tag distribution refinement. IEEE/ACM Trans. Audio Speech Lang. Process. 2021, 29, 2339–2350. [Google Scholar] [CrossRef]
Zhang, C.; Lyu, J.; Xu, K. A storytree-based model for inter-document causal relation extraction from news articles. Knowl. Inf. Syst. 2023, 65, 827–853. [Google Scholar] [CrossRef] [PubMed]
Li, D.; Wu, P.; Dong, Y.; Gu, J.; Qian, L.; Zhou, G. Joint learning-based causal relation extraction from biomedical literature. J. Biomed. Inform. 2023, 139, 104318. [Google Scholar] [CrossRef] [PubMed]
Zhang, L.; Yang, Q.; Liu, X.; Guan, H. Rethinking hard-parameter sharing in multi-domain learning. In Proceedings of the 2022 IEEE International Conference on Multimedia and Expo (ICME), IEEE, Taipei, Taiwan, 18–22 July 2022; pp. 1–6. [Google Scholar]
Duong, L.; Cohn, T.; Bird, S.; Cook, P. Low resource dependency parsing: Cross-lingual parameter sharing in a neural network parser. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), Beijing, China, 26–31 July 2015; pp. 845–850. [Google Scholar]
Mrini, K.; Dernoncourt, F.; Yoon, S.; Bui, T.; Chang, W.; Farcas, E.; Nakashole, N. A gradually soft multi-task and data-augmented approach to medical question understanding. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Online, 1–6 August 2021; pp. 1505–1515. [Google Scholar]
Choi, Y.; Cardie, C. Hierarchical sequential learning for extracting opinions and their attributes. In Proceedings of the ACL 2010 Conference Short Papers, Uppsala, Sweden, 11–16 July 2010; pp. 269–274. [Google Scholar]
Salakhutdinov, R.; Tenenbaum, J.B.; Torralba, A. Learning with hierarchical-deep models. IEEE Trans. Pattern Anal. Mach. Intell. 2012, 35, 1958–1971. [Google Scholar] [CrossRef] [PubMed]
Chronopoulou, A.; Peters, M.E.; Dodge, J. Efficient hierarchical domain adaptation for pretrained language models. arXiv 2021, arXiv:2112.08786. [Google Scholar]
Frankle, J.; Carbin, M. The lottery ticket hypothesis: Finding sparse, trainable neural networks. arXiv 2018, arXiv:1803.03635. [Google Scholar]
Tseng, H.; Chang, P.-C.; Andrew, G.; Jurafsky, D.; Manning, C.D. A conditional random field word segmenter for sighan bakeoff 2005. In Proceedings of the fourth SIGHAN Workshop on Chinese Language Processing, Jeju Island, Republic of Korea, 14–15 October 2005. [Google Scholar]
Ma, X.; Hovy, E. End-to-end sequence labeling via bi-directional lstm-cnns-crf. arXiv 2016, arXiv:1603.01354. [Google Scholar]

Figure 1. An overview of the MPC−CE model when it takes a sentence as input. Multi-task includes POS tagging, Chunk, and CE, and the 3 subnets on the top right are obtained using the pruning method. The green part in the lower left corner of the figure represents the layer of parameter sharing.

Figure 2. Task performance with different numbers of iterative prunings.

Figure 3. Task performance with different numbers of epochs.

Table 1. Wrongly predicted cases on various datasets.

Case and Label Types	Text and Tagging
SemEval-2010 Task 8	For some reason, the star was blind from his own insight
SemEval-2010 Task 8	about the incommensurability of time.
Golden Label	O O O O O O O O O O O O O O O O O O
Predicted Label	O O O O O O O B-Effect O O O B-Cause O O O O O O
SemEval-2010 Task 8	Infectious mononucleosis due to the Epstein-Barr virus causes
SemEval-2010 Task 8	exudative tonsillitis or pharyngitis in about one-half of cases.
Golden Label	O O O O O O B-Cause O O B-Effect O O O O O O O O
Predicted Label	B-Effect O O O O B-Cause O O O O O O O O O O O O
Event StoryLine	In 2004, a huge earthquake off Aceh triggered a tsunami that
Event StoryLine	killed 230,000 people across Asia.
Golden Label	O O O O O O O O O O B-Cause O B-Effect O O O O O O O
Predicted Label	O O O O O O B-Cause O O O B-Effect O O O O O O O O O
Causal TimeBank	Du Pont, unlike companies hurt badly by sharp price declines
Causal TimeBank	for basic chemicals and plastics.
Golden Label	O O O O O B-Effect O O O O B-Cause O O O O O O
Predicted Label	O O O O O B-Effect O O B-Cause O O O O O O O O

Table 3. Statistical results of two datasets. The numbers in parentheses indicate the count of samples with causal relations.

Dataset	Train	Dev	Test	Total
SemEval-2010 Task 8	7501 (906)	1608 (219)	1608 (206)	10,717 (1331)
MTL-CE	2438 (1219)	522 (261)	522 (261)	3482 (1741)

Table 4. Multi-task learning results of models on SemEval2010-Task 8 dataset. The experimental results in parentheses are increments compared to single-task learning.

Model	POS	Chunk			CE
Model	ACC	P	R	F1	P	R	F1
CNN+BiLSTM	96.33	92.07	90.99	91.53	77.94	64.32	70.48 (+1.89)
CNN+BiLSTM+CRF	96.12	90.29	91.53	90.90	75.59	71.64	73.56 (+3.75)
BERT+BiLSTM	96.37	92.24	91.87	92.14	75.94	78.16	77.03 (+2.54)
BERT+BiLSTM+CRF	96.47	90.96	92.49	91.72	77.86	77.67	77.76 (+0.99)
MPC–CE	96.46	91.89	92.31	92.10	80.56	77.43	78.96 (+2.19)

We mark the best results with bold face.

Table 5. Multi-task learning results of models on MTL-CE dataset. The experimental results in parentheses are increments compared to single-task learning.

Model	POS	Chunk			CE
Model	ACC	P	R	F1	P	R	F1
CNN+BiLSTM	95.99	93.30	92.04	92.77	73.91	77.44	75.63 (+2.87)
CNN+BiLSTM+CRF	95.92	93.97	91.17	92.55	75.87	75.33	75.60 (+1.46)
BERT+BiLSTM	96.51	92.72	92.26	92.49	76.59	76.21	76.40 (+2.10)
BERT+BiLSTM+CRF	96.39	90.92	92.23	91.57	75.63	76.21	75.92 (+0.12)
MPC−CE	96.40	91.15	92.43	91.78	76.46	76.46	76.46 (+3.29)

We mark the best results with bold face.

Table 6. Single-task learning model experimental results of SemEval2010-Task 8 dataset. Bold numbers are better results compared to baselines under different metrics.

Model	P	R	F1
CNN+BiLSTM	74.79	63.35	68.59
CNN+BiLSTM+CRF	73.46	66.50	69.81
BERT+BiLSTM	73.52	75.49	74.49
BERT+BiLSTM+CRF	78.64	75.00	76.77

Table 7. Single-task learning model experimental results of MTL-CE dataset. Bold numbers are better results compared to baselines under different metrics.

Model	P	R	F1
CNN-BiLSTM	73.26	72.28	72.76
CNN-BiLSTM-CRF	74.64	75.00	74.82
BERT-BiLSTM	71.62	77.18	74.30
BERT-BiLSTM-CRF	75.16	76.46	75.80

Table 8. Performance of the combination of Pos and CRE tasks on the MTL-CE dataset.

Model	POS	CE
Model	ACC	P	R	F1
CNN+BiLSTM	95.97 (−0.02)	73.59	77.35	75.44 (−0.188)
CNN+BiLSTM+CRF	95.91 (−0.01)	75.56	75.26	75.41 (−0.190)
BERT+BiLSTM	96.48 (−0.03)	75.98	76.13	76.05 (−0.345)
BERT+BiLSTM+CRF	96.37 (−0.02)	75.02	76.19	75.60 (−0.320)
MPC−CE	96.39 (−0.01)	75.79	76.38	76.08 (−0.376)

Table 9. Performance of the combination of Chunk and CRE tasks on the MTL-CE dataset.

Model	Chunk			CE
Model	P	R	F1	P	R	F1
CNN+BiLSTM	93.24	92.02	92.63 (−0.14)	73.43	77.31	75.32 (−0.310)
CNN+BiLSTM+CRF	93.92	91.15	92.51 (−0.04)	75.48	75.21	75.34 (−0.255)
BERT+BiLSTM	92.64	92.24	92.44 (−0.05)	75.66	76.08	75.86 (−0.530)
BERT+BiLSTM+CRF	90.88	92.20	91.54 (−0.03)	74.80	76.19	75.48 (−0.431)
MPC−CE	91.11	92.42	91.76 (−0.02)	75.63	76.31	75.97 (−0.491)

Table 10. Performance of pre-pruning and post-pruning during the multi-task learning.

Dataset	Task	Single-Task & Pruning			Pruning Radio	Multi-Task & Pruning
Dataset	Task	P	R	F1/ACC	%	P	R	F1/ACC
SemEval-2010 Task 8	Pos			96.44	85.13			96.46
	Chunk	91.47	92.53	92.00	61.70	91.89	92.31	92.10
	CRE	79.18	78.40	78.79	52.53	80.56	77.43	78.96
MTL-CE	Pos			96.37	61.70			96.40
	Chunk	91.06	92.40	91.72	27.59	91.15	92.43	91.78
	CRE	74.23	76.21	75.21	61.70	76.46	76.46	76.46

Table 11. Results of Chunk on CoNLL 2003 (English) dataset.

Model	Chunk
Model	P	R	F1
CNN+BiLSTM	89.07	88.31	88.68
CNN+BiLSTM+CRF	89.15	88.04	88.59
BERT+BiLSTM	90.74	89.86	90.30
BERT+BiLSTM+CRF	89.56	90.39	89.97
NLTK	91.13	90.65	90.89

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, Y.; Liu, C.; Zhu, Y.; Chen, W. Multi-Task Sequence Tagging for Denoised Causal Relation Extraction. Mathematics 2025, 13, 1737. https://doi.org/10.3390/math13111737

AMA Style

Zhang Y, Liu C, Zhu Y, Chen W. Multi-Task Sequence Tagging for Denoised Causal Relation Extraction. Mathematics. 2025; 13(11):1737. https://doi.org/10.3390/math13111737

Chicago/Turabian Style

Zhang, Yijia, Chaofan Liu, Yuan Zhu, and Wanyu Chen. 2025. "Multi-Task Sequence Tagging for Denoised Causal Relation Extraction" Mathematics 13, no. 11: 1737. https://doi.org/10.3390/math13111737

APA Style

Zhang, Y., Liu, C., Zhu, Y., & Chen, W. (2025). Multi-Task Sequence Tagging for Denoised Causal Relation Extraction. Mathematics, 13(11), 1737. https://doi.org/10.3390/math13111737

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Multi-Task Sequence Tagging for Denoised Causal Relation Extraction

Abstract

1. Introduction

2. Related Work

2.1. STL Strategy

2.2. MTL Strategy

2.3. Parameter-Sharing Strategies

3. Model

3.1. Sharing Layer

3.1.1. BERT Layer

3.1.2. BiLSTM Layer

3.2. Special Layer

3.2.1. Linear Layer

3.2.2. CRF

3.3. Sparse Sharing in MPC−CE

4. Experiment

4.1. Datasets

4.2. Metrics

4.3. Baselines

4.4. Implementation

4.5. Multi-Task Learning Performance (RQ1)

4.6. Task Analysis (RQ2)

4.7. Ablation Performance (RQ3)

4.8. Parameter Analysis (RQ4)

5. Conclusions

Author Contributions

Funding

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI