Fd-CasBGRel: A Joint Entity–Relationship Extraction Model for Aquatic Disease Domains

Ye, Hongbao; Lv, Lijian; Zhou, Chengquan; Sun, Dawei

doi:10.3390/app14146147

Open AccessArticle

Fd-CasBGRel: A Joint Entity–Relationship Extraction Model for Aquatic Disease Domains

¹

College of Mathematics and Computer Science, Zhejiang A&F University, 666 Wusu Street, Hangzhou 311300, China

²

Agricultural Equipment Research Institute, Zhejiang Academy of Agricultural Sciences, 298 Desheng Middle Road, Hangzhou 310021, China

³

Key Laboratory of Agricultural Equipment in Southeast Hilly and Mountainous Areas of the Ministry of Agriculture and Rural Affairs (Ministry-Province Joint Construction), 298 Desheng Middle Road, Hangzhou 310021, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Appl. Sci. 2024, 14(14), 6147; https://doi.org/10.3390/app14146147

Submission received: 3 May 2024 / Revised: 16 June 2024 / Accepted: 27 June 2024 / Published: 15 July 2024

(This article belongs to the Special Issue Applications of Machine Learning Technology in Agricultural Data Mining)

Download

Browse Figures

Versions Notes

Abstract

Featured Application

The model is primarily utilized for the task of entity relationship extraction during the construction process of an aquatic disease knowledge graph.

Abstract

Entity–relationship extraction plays a pivotal role in the construction of domain knowledge graphs. For the aquatic disease domain, however, this relationship extraction is a formidable task because of overlapping relationships, data specialization, limited feature fusion, and imbalanced data samples, which significantly weaken the extraction’s performance. To tackle these challenges, this study leverages published books and aquatic disease websites as data sources to compile a text corpus, establish datasets, and then propose the Fd-CasBGRel model specifically tailored to the aquatic disease domain. The model uses the Casrel cascading binary tagging framework to address relationship overlap; utilizes task fine-tuning for better performance on aquatic disease data; trains on specialized aquatic disease corpora to improve adaptability; and integrates the BRC feature fusion module—which incorporates self-attention mechanisms, BiLSTM, relative position encoding, and conditional layer normalization—to leverage entity position and context for enhanced fusion. Further, it replaces the traditional cross-entropy loss function with the GHM loss function to mitigate category imbalance issues. The experimental results indicate that the F1 score of the Fd-CasBGRel on the aquatic disease dataset reached 84.71%, significantly outperforming several benchmark models. This model effectively addresses the challenges of ternary extraction’s low performance caused by high data specialization, insufficient feature integration, and data imbalances. The model achieved the highest F1 score of 86.52% on the overlapping relationship category dataset, demonstrating its robust capability in extracting overlapping data. Furthermore, We also conducted comparative experiments on the publicly available dataset WebNLG, and the model in this paper obtained the best performance metrics compared to the rest of the comparative models, indicating that the model has good generalization ability.

Keywords:

relational extraction; aquatic diseases; Casrel; fine-tuned pretrained model; self-attention mechanisms; relative position coding; BiLSTM; GHM loss function

1. Introduction

China leads the world in aquaculture and had a total output of 68.65 million tons of aquatic products in 2022 alone [1]. Of this, 55.65 million tons are directly attributed to aquaculture production, whose ratio to fishing production is at least 81:19. This substantial contribution from aquaculture not only ensures national food security and the supply of vital agricultural products but also plays a pivotal role in boosting farmers’ income [2,3]. However, factors such as high-density stocking and water pollution have led to a high incidence of disease during aquaculture. Accordingly, the diagnosis and prevention of aquatic diseases have emerged as critical bottlenecks hindering the industry’s rapid and sustainable development. It is imperative to leverage modern information technology to empower traditional aquaculture practices and hasten the transition towards digitalization and intelligence in disease diagnosis and prevention techniques.

Relationship extraction is a fundamental step when constructing a knowledge graph [4], which aims to identify entity–relationship ternaries (HEAD, RELATIONSHIP, TAIL) from unstructured text. In the aquatic disease domain, that relationship extraction primarily involves extracting various entities, such as established diseases, control methods, drugs, and pathogens, from textual data related to aquatic diseases, as well as elucidating the relationships among them. These extracted data are then organized into a knowledge graph, which can be readily used in various downstream applications including intelligent Q&A systems, early disease warning systems, and knowledge visualization tools. By integrating these technologies into the aquatic disease control process, digitalization can be enhanced, thereby improving overall disease management efforts.

For the entity–relationship extraction task within the aquatic disease domain, its extensive overlap in relationships poses a significant issue [5] that impacts the effectiveness of ternary group extraction. Further, this domain’s data exhibit notable specialization and imbalance, being characterized by the prevalence of many specialized terms and an imbalance between data categories and classification difficulty levels. Consequently, generalized pretraining language models encounter challenges in extracting semantic features and have poor domain adaptability, thereby adversely affecting extraction performance. In addition, in the process of entity–relationship extraction, many supplementary features beyond the encoded features of sentences and entities remain underutilized, leading to feature loss. To address these challenges, this paper proposes a cascading binary labeling framework based on a fine-tuned pretraining model, a feature fusion module, and a GHM loss function. We show that this framework can effectively resolve the aforementioned issues and bolsters model performance on small-scale, unbalanced aquatic disease-relationship extraction tasks.

2. Related Work

Entity–relationship extraction is instrumental in the construction of knowledge graphs. The traditional pipeline method for doing that divides the entity and relationship extraction task into separate subtasks, initially extracting entity pairs and later categorizing them based on relationship labels. The main drawback of this approach is that it ignores the potential correlation between these subtasks, which leads to error propagation [6]. In contrast, joint entity–relationship extraction integrates entity recognition and relationship recognition, effectively solving the error propagation problem. Early joint entity–relationship extraction methods include feature engineering methods [7], tree-structured joint models [8], and joint extraction methods based on sequence annotation [9,10,11]. These methods have drawbacks such as poor portability and complex data labeling and feature construction. Further, the key problem of relationship overlap, particularly prevalent in specialized domains like aquatic sciences, is hard to solve using the aforementioned approaches.

In this context, Yu et al. [12], Zhuang et al. [13], and Zeng et al. [14] were able to successfully recognize overlapping relation triplets by augmenting the labeling strategy. Yet these approaches still treat relations as discrete labels of entities, which inevitably limits their performance. Wei et al. [15] devised a pointer network to comprehensively model triplets by learning the mapping function linking relations to entities. To address the Casrel model’s low computational efficiency due to relationship redundancy, Zheng et al. [16] introduced the PRGC model, which breaks down the task into three steps: relationship judgment, entity extraction, and entity alignment. This approach effectively enhances computational efficiency. Models for the extraction of overlapping relationships based on copying mechanisms rely on various feature extraction networks to capture semantic features of target entity fragments after localization and then directly copy them to the Decoder for extraction. Representative models for that approach include CopyR [9] and DPointer [17], but they have their own drawbacks, such as complex mechanisms and entity relationship mismatches. The table-filling methods maintain a table for each relation, with each item indicating whether the tagged pair displays that particular relation or not, e.g., TPLinker [18], PFN [19], and UniRel [20]. These models are effective in overcoming the problem of resolving overlapping relationships, but they tend to have high model complexity. In recent years, the emergence of Graph Convolutional Neural Networks (GCNNs) has been monumental since their graph data structure effectively conveys various NLP task features. GCNNs have been applied to the study of overlapping relationship extraction, whereby syntactic dependency trees are converted into adjacency matrices and inputted into the graph neural network for relationship extraction. Noteworthy examples include research by Wang et al. [21], Fu et al. [22], and Duan et al. [23]. Relational extraction models based on pretrained language models mainly utilize generative pretrained language models to extract features. Through data fine-tuning, they are capable of achieving adequate performance in downstream tasks. Representative models include REBEL [24], CGT [25], UIE [26], and SPN [27]. Nevertheless, such pretrained models evidently have non-trivial shortcomings, such as difficulty in handling long text, as well as complex sentence structures, along with high computational and data requirements.

In the field of fisheries, Yang [28] proposed a BERT-BiLSTM-CRF model that incorporates dual attention mechanisms for words and sentences. This model was designed to enhance the effectiveness of relationship extraction by handling key issues, such as semantic loss in lengthy sequences and the irrational weight allocation of vectors, within the fisheries domain. Additionally, Liu et al. [29] developed the CaBiLSTM model, employing a layered approach to improve the recognition accuracy of outer layer entities. This approach involves dimensionality reduction of inner entity features to mitigate nested entity obstacles in Aquatic Disease Named Entity Recognition (NER) and integrates BERT for enhanced performance. Jiang [30] introduced a BIO-based entity relationship joint annotation strategy, which, in combination with the BERT-BiLSTM-CRF model, strengthens the relationship extraction performance. Bi et al. [31] utilized BERT-BiLSTM to carry out feature extraction from lengthy texts in aquaculture. They applied the N-Gram algorithm to segment features for integration into a cascading BiLSTM model, in this way resolving the issues of information loss and misclassification in long texts to bolster the performance of entity-relation joint extraction.

Previous research in the aquatic domain has mainly focused on extracting information from lengthy texts, identifying nested entities, and conducting the joint extraction of entity relationships. However, the persistent problem of relationship overlap within the aquatic disease domain remains largely unaddressed. The prevalence of abundant overlapping triplets often leads to recognition errors and incomplete extractions, substantially diminishing the effectiveness of relationship extraction. Moreover, the specialized nature of the aquatic disease corpus itself poses stark challenges, namely, the presence of domain-specific terms like disease and a plethora of drug names. This complexity makes it very hard for generic preprocessing models to accurately capture semantic features, resulting in poor domain suitability. Compounding that difficulty, the imbalance in aquatic disease data impedes effective model learning. Lastly, during the extraction process, many features beyond the encoded vectors of sentences will go underutilized, leading to both feature loss and an insufficiently nuanced semantic expression, which reduces model performance. Existing research methods have fallen short of soundly addressing these outstanding challenges. That said, the cascade tagging framework of the Casrel model does show promise in effectively identifying overlapping relationship triads, offering a potential solution to the relationship overlap problem in aquatic disease data. Therefore, this paper adopts the cascading binary labeling framework as its foundational structure. Expanding on that, it integrates a fine-tuned pretraining model, a feature fusion module, and the GHM loss function to construct an entity–relationship joint extraction model tailored to the aquatic disease domain. The main contributions of this approach are summarized as follows:

(1): A textual corpus pertaining to aquatic diseases was gathered, from which a dataset was compiled for the aquatic disease relationship extraction. This dataset entails 33 relationship categories and comprises a total of 10,068 data entries.
(2): The cascading binary labeling framework was employed to tackle the issue of relationship overlap, while the Roberta-wwm-ext pretrained model was adopted to replace the Bert pretrained model. Next, the model was fine-tuned with domain-specific knowledge related to aquatic diseases. This process of fine-tuning sought to enhance the model’s adaptation to the task of extracting relationships among entities related to aquatic diseases, particularly in scenarios with limited data samples, thus improving its domain suitability.
(3): To address the issue of data imbalance, the GHM loss function has been incorporated into the model.
(4): To further enhance the degree of feature fusion and enrich the semantic features of the model inputs, this paper introduces a feature fusion module named BRC into the infrastructure. The comprehensive structure combines a self-attention mechanism, a BiLSTM network, a multi-head attention mechanism with relative position encoding, and a conditional layer normalization layer. This module integrates the extracted sentence context features, head entity features, and head entity position features through the conditional layer normalization module, thereby enriching the semantic representation of the overall input and improving the effectiveness of the model’s extraction capabilities.

3. Aquatic Disease Dataset Construction

Currently, there is a dearth of openly available datasets for the entity–relationship extraction task within the aquatic disease domain. Hence, we relied on paper books and websites as primary data sources to establish a corpus database. Building upon this foundation, we designed the data structure and used Colabeler (v2.0.4) annotation software to perform the manual data annotation. This culminated in the creation of ‘AquaticDiseaseRE’, a dataset specifically tailored for aquatic disease research.

3.1. Data Collection

This paper gathers textual data from aquatic disease books and supplements it with data obtained by crawling and combing through professional websites. A total of 10,068 text entries were collected, these containing approximately 450,000 characters in the corpus. The primary data sources are outlined in Table 1:

3.2. Pattern Layer Design

The hierarchical structure of aquatic disease data was crafted by a thorough examination of the literature, for instance, Aquatic Animal Pathology, and in consultation with experts on aquatic diseases, as depicted in Figure 1 (Details of specific types and their descriptions can be found in Appendix A.1).

3.3. Dataset Segmentation

In this paper, the dataset was partitioned into a training set, validation set, and test set in an 8:1:1 ratio. The training set consisted of 8208 data samples, while both the validation and test sets comprised 1026 data samples each. Detailed statistics of the relationship categories in the dataset, along with some examples of their data samples, are presented in Figure 2 (details of specific types and their descriptions can be found in Appendix A.2.) and also in Table 2.

The dataset developed in this paper generally faces the issue of Single Entity Overlap (SEO), where a single head entity is simultaneously linked to multiple distinct relationships. So, we classified the AquaticDiseaseRE dataset into Normal No Overlap (Normal) and Single Entity Overlap (SEO) categories based on the type of relationship overlap, as shown in Table 3:

Additionally, in this paper, the assembled AquaticDiseaseRE dataset was analyzed for sentence length and the number of triplets, as shown in Figure 3 and Table 4.

4. Joint Extraction Model for Aquatic Disease Entity Relationships

4.1. Current Issues

Currently, within the aquatic disease domain, the chief challenges to its relationship extraction include relationship overlap, a high degree of corpus specialization, insufficient feature integration, and data imbalance.

(1): Overlapping relationships: This concept, also known as ternary overlap, is split into two key categories: Single Entity Overlap (SEO) and Entity Pair Overlap (EPO). Within the assembled AquaticDiseaseRE dataset—designed for extracting relationships in the context of aquatic diseases—the relationships mainly follow logical knowledge structures, so the consideration of EPO cases is unnecessary. The phenomenon of SEO is more common, being chiefly characterized by linkages between aquatic diseases, pathogens, and their associated subclass concepts. As Figure 4 shows, the term ‘Normal’ designates those instances featuring standard, non-overlapping relationships.
(2): The text corpus is highly specialized: Evidently, this corpus for the aquatic disease domain exhibits a high degree of specialization, being replete with an extensive array of proper nouns. For instance, the sentence “Infectious pancreatic necrosis disease causes the diseased fish’s body color to turn black, eyeballs to protrude, and congestion and hemorrhage at the base of the ventral fins, which can be prevented and controlled with povidone-iodine and erythromycin” illustrates this point. Certain terms such as “infectious pancreatic necrosis disease”, “abdominal fins”, “povidone-iodine”, and “erythromycin” are quite specific vocabulary within the aquatic domain of research, encompassing disease names, affected body parts, and both preventative and treatment interventions. Traditional BERT-series pretrained language models—developed from general corpus data—struggle to accurately encode the semantics of these specialized terms. This handicap greatly impairs the performance of subsequent entity–relationship extraction tasks.
(3): Limited integration of features: In this paper, we explore the dependency of the Casrel model’s extraction performance on the semantic richness of input features. The original model merely employs a simplistic aggregation of BERT-encoded sentence vectors and entity representation vectors for input into the relation–tail entity recognition module. This approach proves clearly inadequate as it overlooks several crucial features, including sentence context and entity location attributes. Additionally, this rudimentary method of feature fusion can result in the loss of features, confusion of information, and a lack of adaptability in adjusting fusion strategies according to specific tasks.
(4): The data is imbalanced: An imbalance in data mainly occurs in two ways: category imbalance and sample difficulty imbalance, with the task of aquatic disease relationship-extraction saddled with both challenges. The category imbalance is evident from the diverse relationship categories within the data and the significant variation in the sample sizes across different relationship categories, as Figure 2 shows. The Casrel model operates on a cascading binary labeling framework, utilizing this architecture for head entity recognition by determining whether each character in a sentence marks an entity boundary. This process of head entity recognition, alongside relationship and tail entity recognition, relies on a similar structural approach, where a ‘1’ label signifies a character as an entity boundary, in contrast to a ‘0’ label assigned to characters lying outside those boundaries. Typically, the ‘0’ label samples greatly outnumber the ‘1’ label samples and are simpler to classify, highlighting the imbalance between these difficult and easy types of samples.

4.2. Model Architecture

We propose an approach for entity–relationship joint extraction that leverages the cascading binary labeling framework. This framework integrates a series of advanced components, namely, a finely tuned Roberta-wwm-ext [32] pretrained model, the Gradient Harmonizing Mechanism (GHM) for loss adjustment [33], the self-attention Mechanism [34], and a Bidirectional Long Short-Term Memory (BiLSTM) network [5]. It also employs a Multihead Attention Mechanism based on relative position encoding (RPE) [35] in addition to conditional layer normalization (CLN) techniques [36]. These elements are synergistically combined in this study to create the Fd-CasBGRel model; it is specifically designed for extracting entity relationships in the aquatic disease domain. This model adeptly addresses the challenges discussed in Section 4.1, enabling better extraction efficiency and accuracy. The model’s architecture is divided into two core sections: the ‘Encoder’ and ‘Decoder’. The Encoder processes and encodes sentences, feeding them into the Decoder, which in turn is tasked with detecting relationship triplets within sentences. The Decoder is built around a head entity tagger and a relationship–tail entity tagger; its detailed architecture is illustrated in Figure 5.

4.2.1. FD–Roberta-wwm-ext Encoder

The Encoder serves as the model’s input layer, primarily tasked with extracting textual features and converting sentences into fixed-dimensional feature-encoding vectors. In this work, we replace the Bert model with the Roberta-wwm-ext pretrained model, which is then fine-tuned for our specific task by using knowledge of aquatic diseases. Unlike the original Casrel model’s Encoder that relies on the Bert pretraining model—trained on an English corpus and, thus, less effective at encoding Chinese text encoding—we employ the Chinese Roberta-wwm-ext model. This substitution improves our proposed model’s coarse-grained semantic modeling capabilities.

Still, both the Bert and Roberta-wwm-ext models are trained on public corpus sets. But the field of aquatic diseases encompasses a rich array of specialized terminology, and since the Roberta-wwm-ext model lacks domain-specific adaptation, a suboptimal encoding of specialized corpora ensues. Gururangan et al. [37] suggested that in cases of sparse samples or domain-specific applications, fine-tuning pretrained language models could significantly enhance performance. Candidate fine-tuning strategies for these models include In-Domain Further Pretraining and Within-Task Further Pretraining. Here, we implement a hybrid strategy that combines these two approaches. We use the aquatic disease dataset developed in this study, augmented with additional aquaculture knowledge gathered via Scrapy, to fine-tune the Roberta-wwm-ext model. This fine-tuning utilized hyper-parameters with a batch size of 8 and an epoch count of 5. The resulting model was named FD–Roberta-wwm-ext.

The input to the encoding layer is an input vector e, which is obtained by summing the token embedding vector e_w and the position embedding vector e_p. This vector is then processed through N-layers of the ‘Transformer’ to yield the sentence’s final vector representation, h_N. The overall structure is depicted in Figure 6, while the computational process is detailed in Equations (1) and (2):

h_{1} = Trans (e)

(1)

h_{N} = Trans (h_{N - 1})

(2)

4.2.2. SAB Feature Enhancement Layer

To enhance the semantic representation of input sentence vectors and to extract additional features, this study introduces a feature enhancement module named SAB. It integrates a self-attention mechanism and a BiLSTM (Bidirectional Long Short-Term Memory) network in a sequential manner that follows the FD–Roberta-wwm-ext pretrained model. First, the sentence vectors h_N are processed by a self-attention network to further capture sentence tele-dependencies and local key features. Next, it is inputted into the BiLSTM network to further extract contextual features of the sentence.

Initially, a self-attention mechanism is incorporated to process the sentence vectors encoded by Bert, enhancing the ability to capture distant sentence dependencies and emphasize local key features. Subsequently, these vectors, now processed through the self-attention mechanism, are fed into the BiLSTM network to extract contextual features of the sentences. The integration of the self-attention mechanism not only augments the synergy between Bert and BiLSTM but also improves the model’s capability to capture contextual nuances and strengthens the semantic representation of the input sentences.

The calculation process went as follows (Equation (3)):

h_{s a} = Self - Attention (h_{N})

(3)

where h_N represents the input sentence vector, and h_sa denotes the feature vector obtained after encoding by the self-attention mechanism.

BiLSTM—a Bidirectional Long Short-Term Memory network—excels at extracting bidirectional contextual features from lengthy sequences and at capturing semantic dependencies in either direction. Hence, to further distill the semantic features of the input sentence, we use the BiLSTM network to encode the feature vector processed by the ECA network. This step provides a contextual representation of the sentence, as depicted in Equation (4):

b_{N} = BiLSTM (h_{N})

(4)

where b_N refers to the sentence vector encoded by the BiLSTM network.

4.2.3. Header Entity Tagger

The head entity tagger primarily serves to identify the head entity within an input sentence. It utilizes the feature vector b_N, produced by the SAB module, as its input. The output is a two-layer, One-Hot Encoding (OHC) vector, which clearly distinguishes the start and end positions of the head entity. Specifically, b_N first traverses two linear layers, each with an output dimension of 1, and later undergoes a sigmoid activation function. Binary classification is then performed based on a predefined probability threshold, whereby the vectors at the start or end positions of the head entity are assigned a value of 1, while all others are assigned a value of 0. This process is detailed in Equations (5) and (6):

p_{i}^{s t a r t_s b} = sigmoid (W_{s t a r t_s b} x_{i} + b_{s t a r t_s b})

(5)

p_{i}^{e n d_s b} = sigmoid (W_{e n d_s b} x_{i} + b_{e n d_s b})

(6)

where x_i is the encoding vector of the i-th character within the sentence encoding vector h_N; the W and b denote the learnable weights and biases in the linear layer, respectively, with ‘sigmoid’ denoting the activation function. The corresponding maximum likelihood function for the head entity within a given sentence is defined as follows:

p_{θ} (s | x) = {\prod_{t \in {s t a r t_s b, e n d_s b}} \prod_{i = 1}^{L} (p_{i}^{t})}^{I (y_{i}^{t} = 1)} {(1 - p_{i}^{t})}^{I (y_{i}^{t} = 0)}

(7)

where L is the length of the input sentence, and

y_{i}^{t}

represents the positional classification labels for characters at the start/end positions of the head entity, taking values in the set {0, 1}. The parameters that need to be optimized are denoted by θ = {W_{start_sb}, b_{start_sb}, W_{end_sb}, b_{end_sb}}.

4.2.4. BRC Feature Fusion Module

The original Casrel model has a critical deficiency in how it facilitates information interactions and in correlating the head entity tagger and relation–tail entity tagger. In merely summing the vectors of the head entity and the sentence as subsequent inputs, crucial features are lost. In the literature [38], one finds the introduced the PATB relation extraction model, which increases performance by integrating the head entity and tail character vectors as entity location information for feature fusion. However, relying solely on these vectors for entity location features is inappropriate because it neglects the relative positional features among characters so it is less likely to capture sequential associations accurately. In response, this paper introduces a feature fusion module named BRC, which leverages head entity-relative position encoding along with sentence-local and contextual features. This approach aims to not only strengthen the connection between the head entity tagger and the relation–tail entity tagger but also to maximize the utility of extra features for bolstering the entity–relationship extraction. The BRC module incorporates the previously developed EB feature enhancement layer, a multi-head attention network (RPE) with relative position encoding, and conditional layer normalization (CLN), thus presenting a comprehensive framework for improving the relation extraction performance.

The general workflow went as follows. Begin by extracting the sentence enhancement feature vector b_N encoded by the SAB module; then, calculate the relative position encoding for the first and last characters of the head entity; finally, input the sentence-enhancement feature vector, head entity vector, and head entity-relative position vector into the CLN for feature fusion. This fused feature vector then replaces the original Casrel model’s simplistic summing of the head entity and sentence vectors to serve as the new input to the relation–tail entity tagger.

Multihead Attention Networks with Relative Position Computation

After obtaining the first and last word vectors, v_s and v_e, of the head entity, we can calculate their respective relative position representations for each character within the sentence vector b_n = [v₁, v₂, …, v_n]. These word vectors are then weighted accordingly to serve as the ‘Query’, while the sentence vectors are weighted to act as both ‘Key’ and ‘Value’. After doing so, they are fed into the multi-head attention network as parameters to derive the head entity’s relative position-encoding vectors. Taking the computation of the relative position encoding for the head entity’s first character vector, v_s, as an example, the process is detailed in Equations (8)–(14).

clip (v, k) = \max (- k, \min (k, v))

(8)

a_{s j}^{K} = w_{clip (j - s, k)}^{K}

(9)

a_{s j}^{V} = w_{clip (j - s, k)}^{V}

(10)

z_{i}^{s} = \sum_{j = 1}^{n} a_{i}^{s j} (v_{j} W_{i}^{V} + a_{s j}^{V_{i}})

(11)

a_{i}^{s j} = \frac{\exp (e_{i}^{s j})}{\sum_{k = 1}^{n} \exp (e_{s k})}

(12)

e_{i}^{s j} = \frac{v_{s} W_{i}^{K} {(v_{j} W_{i}^{K} + a_{s j}^{k i})}^{T}}{\sqrt{d_{z}}}

(13)

r p e_{s} = Concat (z_{1}^{s}, z_{2}^{s}, \dots, z_{h}^{s}) W^{o}

(14)

Here,

a_{s j}^{K}

and

a_{s j}^{V}

denote the relative position representation between the first character vector of the head entity, v_s, and the j-th character vector of the sentence vector, b_n, where j-s refers to the relative position between the two characters, and k is the maximum relative position-encoding distance. The terms

e_{j i}^{s}

,

a_{j i}^{s}

, and

v_{j i}^{s}

, respectively, denote the scaled dot-product representation of the Query and Key computed by the i-th head in the multi-head attention mechanism, the multi-head attention calculation score, and the attention-encoding vector. The rpe_s refers to the relative position-encoding vector of the first character of the head entity. The variable h is the number of heads in the multi-head attention mechanism and w^o is the weight associated with that. The relative position-encoding rpe_e of the tail character vector of the head entity is computed in a manner analogous to rpe_s.

Conditional Layer Normalization (CLN) Layer

The CLN’s architecture is inspired by conditional batch normalization (CBN), which is a structure that incorporates conditional multi-information fusion in the field of image processing. We present a multi-conditional CLN module for feature fusion whose main structure is illustrated in Figure 7.

The vector of the head entity, denoted as sub, along with the two relative position-encoding vectors rpe_s and rpe_e—respectively for the first and last characters of the head entity—are concurrently fed into the CLN layer. This input is accompanied by the sentence vector b_N; it is already refined through the self-attention network and the BiLSTM-encoding layer. The process eventually generates feature fusion vectors that amalgamate sentence context features, sentence localization features, and the positional attributes of the head entity. This integration facilitates the directed control of information output. The involved mechanism is detailed in Equation (15):

CLN (b_{N}, c_{s u b}, c_{r p e_{s}}, c_{r p e_{e}}) = \frac{b_{N} - E [b_{N}]}{\sqrt{V a r [b_{n}] + ε}} W_{1} c_{s u b} + W_{2} c_{r p e_{s}} + W_{3} c_{r p e_{e}}

(15)

Here, the ‘c’ represents the conditional function, while E[b_N] and Var[b_n] are the mean and variance, respectively. The term ε is a very small constant > 0, which we introduced to avoid computational errors that may arise from a zero denominator.

4.2.5. Relationship–Tail Entity Tagger

The relation–tail entity tagger identifies both relations and tail entities by utilizing a structure identical to that of the head entity tagger, such that their quantity aligns with the number of relation categories. For this, we use the feature fusion vector, encoded by the BRC module, as the input. The encoding process for each character within the input sentence is detailed in Equations (16) and (17):

p_{i}^{s t a r t_o b} = sigmoid (W_{s t a r t_o b} c_{i} + b_{s t a r t_o b})

(16)

p_{i}^{e n d_o b} = sigmoid (W_{e n d_o b} c_{i} + b_{e n d_o b})

(17)

where c_i represents the vector representation of the i-th character in the sentence, as encoded by the CLN module, and W and b are the weight and bias terms, respectively, of the tail entity labeling module, with the sigmoid function being the activation mechanism.

4.2.6. GHM Loss Function

Since the model developed in this paper employs a cascading binary pointer network for the entity–relationship extraction, this fundamentally constitutes a binary classification task. Therefore, we applied a binary cross-entropy loss function for the loss computation. To identify entity boundaries and relationship–tail entities, two classification matrices of dimensions l × 1 and l × r were constructed, where l is the sentence length and r is the number of relationship categories. In spite of this setup, the dataset harbors considerable diversity in the relationship categories, leading to a pronounced disparity in sample distribution across them. This discrepancy manifests as a category imbalance during both the head entity labeling and relationship–tail entity labeling phases. Specifically, non-positional labels ‘0’ outnumber the positional labels ‘1’, and negative samples are mostly labeled ‘0’; they constitute a hefty proportion of the dataset. This predominance of easily classifiable negative samples and the resultant label imbalance could potentially impair the effectiveness of model training.

To deal with that category imbalance issue, the Focal Loss function [39] is often employed for the adjustment of positive and negative samples. It uses two adjustment factors, α and γ, to balance the distribution of positive and negative samples as well as that between difficult and easy samples, in this way aiming to rectify the imbalance of data samples. Still, the Focal Loss function is not without its shortcomings. Firstly, it disproportionately focuses on positive samples that are hard to classify while neglecting the training of negative samples, which can lead to reductions in model performance. Secondly, the interaction between the adjustment factors α and γ unavoidably varies across different datasets, so their optimal values must be determined via extensive experimentation. The GHM loss function offers an innovative solution, by assessing the number of samples within specific ranges using a gradient density variable. More precisely, it mitigates loss by considering the volume of samples within a particular confidence interval p, overcoming the drawbacks besetting the Focal Loss function. The methodology for calculating the GHM loss function is given in Equations (18)–(20):

g = | p - p * | = \{\begin{matrix} 1 - p, p * = 1 \\ p, p * = 0 \end{matrix}

(18)

where g is the length of the gradient mode, p is the prediction probability, and p* denotes the actual label.

GD (g) = \frac{1}{l_{ε} (g)} \sum_{k = 1}^{N} δ_{ε} (g k, g)

(19)

Here, GD(g) is the gradient density; the term δ_ε(gk, g) corresponds to the count of samples whose gradient modulus falls within the range (g − ε/2, g + ε/2) among samples 1 to N, while l_ε(g) denotes the length of the interval (g – ε/2, g + ε/2). The loss for each individual sample is then calculated using Equation (20):

L_{G H M} = \sum_{i = 1}^{N} \frac{L_{CE} (p_{i}, p_{i}^{*})}{GD (g_{i})}

(20)

where N is the total number of samples, the term L_CE(p_i,

p_{i}^{*}

) denotes the cross-entropy loss for the first sample, and GD(g_i) signifies the gradient density for the first sample.

5. Experiments and Analysis

5.1. Experimental Dataset

In this paper, we mainly study the aquatic disease entity–relationship extraction model and apply it to knowledge graph construction, so the main dataset used is AquaticDiseaseRE. However, it is a Chinese domain, so in order to further validate the model’s paradigm in the English domain, we additionally use the English public dataset WebNLG to conduct the comparative experiments. The sentence sample capacity of the training set, validation set, and test set in the WebNLG dataset is 5019, 500, and 703, respectively, and there is a certain amount of triple overlap, which is similar to the self-built AquaticDiseaseRE dataset.

5.2. Experimental Environment

The experiments described here were conducted on a system running Windows 11. It was equipped with an AMD Ryzen 7 5800H processor with Radeon Graphics at 3.20 GHz, along with 16 GB of RAM and an NVIDIA GeForce RTX 3060 graphics card. The software environment consisted of Python v3.9.13 and the PyTorch framework v1.10.1.

5.3. Model Parameters

We use different parameters to conduct experiments for different datasets. The hyperparameters for the experimental model of this study are in Table 5.

5.4. Evaluation Metrics

For the task of relation extraction, precision (P), recall (R), and the F1 score were the chief evaluation metrics, as given by Equations (21)–(23).

P = \frac{T P}{T P + F P}

(21)

R = \frac{T P}{T P + F N}

(22)

F 1 = \frac{2 \times P \times R}{P + R}

(23)

where TP is the number of positive samples correctly predicted, FP is the number of negative samples incorrectly predicted as positive, FN is the number of positive samples incorrectly predicted as negative, and TN is the number of negative samples correctly predicted as negative.

5.5. Pre-Trained Model Replacement Experiment

To investigate the impact of various Chinese pretrained language model versions on the performance of an aquatic disease relationship extraction model when utilized as the encoding layer and, thereby, select the most suitable pretrained model for in-task fine-tuning, we tested four distinct pretrained models. Specifically, Bert, Bert-wwm, Bert-wwm-ext, and Roberta-wwm-ext were chosen for this coding layer-substitution experiment whose results are presented in Table 6.

Evidently, the Roberta-wwm-ext model achieves the highest precision rate (78.92%), recall rate (74.56%), and F1 score (76.68%). Compared to the Bert, Bert-wwm, and Bert-wwm-ext models, the F1 score of Roberta-wwm-ext showed an improvement of 5.78%, 4.46%, and 3.34%, respectively. This superior performance of the Roberta-wwm-ext pretrained model can be attributed to its use of a larger training dataset, a larger batch size, and a whole word-masking mechanism that is better suited to the Chinese language. Collectively, these factors contributed to its advantage over other pre-processing models. Therefore, based on the experimental results, this paper decided upon Roberta-wwm-ext as the foundational model for fine-tuning the encoding layer.

5.6. Impact of Loss Functions on Unbalanced Data

In this study, to address the issue of data imbalance, we implemented the GHM loss function and compared its effectiveness against the Focal Loss and cross-entropy (CE) loss functions as applied to the AquaticDiseaseRE dataset and WebNLG dataset. The Focal Loss was parameterized with α = 0.5 and γ = 0.25. Figure 8 shows the outcomes of these comparisons.

Evidently, the GHM loss value exceeds that of both the CE and Focal Loss. This pronounced difference is due to GHM’s strategy of weighting the gradient of each sample, which ensures a more balanced contribution to the overall loss from gradients of samples varying in difficulty levels. The aquatic disease dataset in this study contains many difficult samples due to its wide-ranging categories and uneven distribution of these samples. Consequently, difficult-to-classify samples constitute a larger fraction of the dataset, increasing the overall loss for the model. Incorporating GHM loss results in a notable enhancement of the model’s overall F1 score vis-à-vis both CE and Focal Loss. Specifically, for the test set, the model using the GHM loss achieves the best performance, showing an improvement of 2.77% and 4.35% over CE and Focal Loss, respectively. This suggests the GHM loss function is able to effectively prioritize the difficult-to-classify samples and reduces the weights for easier ones so as to achieve a better balance of loss, thereby enhancing the model’s predictive accuracy for less represented categories. In contrast, the overall performance of Focal Loss diminishes relative to the CE loss function, likely because of Focal Loss’s excessive focus on hard-to-categorize samples. This overemphasis may cause the model to overfit those outliers.

The WebNLG dataset shows the same rules as the AquaticDiseaseRE dataset. This is because the two datasets have similar characteristics, such as the dataset contains more types of relationships and a large number of overlapping relationship triples. In addition, the loss value fluctuation of the model in the WebNLG dataset is larger than that of the AquaticDiseaseRE. This is because the WebNLG dataset is relatively more complex, with 171 predefined relationship categories, and the proportion of overlapping triples is also larger than that of the AquaticDiseaseRE dataset, and the data are more unbalanced. In general, GHM has good results for datasets with limited and unbalanced samples.

5.7. Comparative Experiment

To verify the effectiveness of the model proposed in this paper, we conducted comparative experiments with several other leading entity relationship extraction models on the AquaticDiseaseRE dataset and the WebNLG dataset:

(1): NovelTagging [14]: A model designed for the joint extraction of entity relations by leveraging innovative annotation strategies;
(2): CopyR [16]: An end-to-end entity relationship extraction model that utilizes a replication mechanism;
(3): GraphyRel [22]: A relationship extraction model whose operation is based upon the structure of relationship graphs;
(4): TPLinker [18]: A labeling framework that utilizes the spans of entity heads and tails;
(5): CasRel [15]: A pointer network model that employs cascading binary tags;
(6): PFN [19]: An entity–relationship extraction model that uses partitioned filter networks;
(7): UniRel [20]: An entity–relationship extraction model that integrates relational semantics.

The experimental findings are presented in Figure 9 and Table 7.

According to Table 7, the model introduced in this paper exhibits impressive performance when applied to the aquatic disease relationship extraction dataset. Specifically, its precision (P), recall (R), and F1 score metrics are distinguished by an enhancement of 5.84%, 14.69%, and 8.66%, respectively, in comparison with the next best-performing model, PFN. Furthermore, the results for the test set were consistent with the F1 value curve for the validation set. The challenge of handling many overlapping relationship triplets in this dataset renders NoverTagging less effective; it cannot properly address the issue of overlapping relationship extraction by solely considering those scenarios where an entity is part of only a single ternary relationship. Meanwhile, the CopyR and GraphyRel series models, which utilize GRU or BiLSTM for encoding, fall short in effectiveness when compared to the Bert pretraining model’s encoding. Although the TPLinker, Casrel, PFN, and UniRel models have marked improvements, the pretraining models they employ are not sufficiently fine-tuned for specialized domains, resulting in a constrained enhancement of their effects relative to the models we proposed in this paper. This outcome further emphasizes that, for highly specific domains, employing fine-tuned pretraining models is essential for augmenting the semantic encoding effect.

In addition, the model proposed in this paper also achieved the best performance on the English public dataset WebNLG, with P, R, and F1 values of 96.79%, 95.46%, and 96.12%, respectively, which are 1.91%, 0.83%, and 1.37% higher than the best-performing UniRel model. Since the WebNLG dataset is public domain data and the data are not very professional, the method of fine-tuning the pre-trained model in the data has a general improvement effect. The experimental results fully demonstrate that the model has potential cross-language application capabilities.

5.8. Experiment on Relationship Overlap

Besides improving the relationship extraction, another critical issue that our model addresses is the problem of relationship overlap. To assess the model’s extraction capabilities under conditions of overlapping relationships, we categorized the test set data into two classes, Normal and Single Entity Overlap (SEO), based on the presence of overlapping relationships. The Normal category has 830 data points, while the SEO category has 196 data points (see Table 3). Figure 10 shows the experimental outcomes.

Evidently, the F1 scores of the model proposed in this study are superior to those of the other nine models for both the Normal and SEO categories tested. The performance of those latter models for these two relationship extraction types displays a general reduction, which suggests the task of extracting entity–relationship triplets gets harder with a greater proportion of shared entities in the dataset. Specifically, NoverTagging, CopyR, and GraphRel models undergo a marked decline in F1 scores when dealing with data that includes relationship overlap, indicating their limitations in handling such overlapping forms of data. By contrast, the remaining models actually exhibit slightly improved performance for overlapping relationship data, which we attribute to their respective architectures’ capability to address that key issue.

The Fd-CasBGRel model developed in this study has the best results for both Normal- and SEO-type relationship extraction tasks. It even records higher F1 scores in SEO- than Normal-type data, which could be due to the smaller sample size of SEO-type in the test set. Overall, these experimental outcomes provide compelling evidence that the model presented in this paper is more adept at resolving the challenge of relationship overlap in the task of aquatic disease relationship extraction.

To further validate the model’s extraction performance across scenarios with varying numbers of triplets, we categorized the data into five classes based on their quantity. These results are presented in Table 8.

A discernible trend emerges where the F1 scores of both NoverTagging and CopyR models diminish to varying extents as the dataset’s triplets increase in frequency. This trend suggests a greater difficulty in processing semantic information and navigating a more complex answer space when more triplets are present, leading to potential confusion and errors during extraction. Additionally, the occurrence of overlapping relation triplets further complicates the extraction process. Conversely, models such as TpLinker, Casrel, PFN, and Unirel have significantly improved F1 scores when the triplet count is between 2 and 4. This better performance is ascribed to these models’ capacity to manage overlapping relational triplets, showcasing their exceptional performance within a specific triplet range. However, as the sentence’s triplet count continues to rise, the F1 scores of those models begin to decline in varying degrees.

The model we developed here achieves the highest F1 score across all five different aquatic disease datasets with varying triplet counts, with a peak F1 score of 90.58% at N = 4. This score not only surpasses several comparative models but also exceeds the best-performing PFN model by 6.44%. This superior performance is credited to the cascading labeling framework employed in this study, which adeptly handles overlapping relationship data across multiple triplets. The integration of a fine-tuned pretraining model and a feature fusion module further enriches the sentence semantic expression and enhances extraction effectiveness. In particular, the incorporated GHM loss function balances well the sample weights and ameliorates the imbalanced data in scenarios involving multiple triplets.

In summary, when assessed alongside several baseline comparative models, this study’s proposed model more effectively addresses the complexities of relationship overlap and multiple triplets in the aquatic disease domain.

5.9. Ablation Experiment

To evaluate the impact of the FD–Roberta-wwm-ext pretrained model, the BRC feature fusion module, and the GHM loss function on aquatic disease entity–relationship extraction, module ablation experiments were also carried out. The training outcomes are conveyed in Figure 11. Here, Casrel denotes the foundational framework, Casrel_BRC refers to the incorporation of only the feature fusion module, Casrel_GHM corresponds to the integration of solely the GHM loss function within the base framework, Casrel_FDRobertawwmext represents the sole addition of the fine-tuned pretraining model, and Fd-CasBGRel is the model proposed in this paper.

We can see from Figure 11 that the Casrel model, when augmented with the fine-tuned pretraining model, demonstrates significant improvements over the baseline Casrel model in terms of precision (P), recall (R), and F1 score metrics, as well as in convergence speed. This enhancement can be attributed to the fine-tuned model’s ability to acquire domain-specific knowledge about aquatic diseases and more effectively encode the semantic features of those entities in comparison to the original pretraining model. Including the BRC feature fusion module alone leads to a notable increase in the model’s p-value relative to the baseline model, suggesting that feature fusion enriches the semantic vector representations, thereby augmenting the accuracy of entity recognition. The R-value undergoes slight improvement, while the overall F1 score surpasses that of the baseline model. Not surprisingly, incorporating the GHM loss function also contributes to performance gains to a certain extent by mitigating the data imbalance. Ablation experiments were also made on the test set, with these results presented in Table 9.

Incorporating the feature fusion module, GHM loss function, and fine-tuned pretraining model individually into the model yields F1 scores of 75.86%, 73.67%, and 82.95%, respectively. This corresponds to an enhancement of 4.96%, 2.77%, and 12.05% vis-à-vis the baseline model, indicating that each of the three augmentations is beneficial to the model’s performance. Notably, a greater improvement is gained from fine-tuning the pretraining model than from both the feature fusion module and GHM loss function. This can be ascribed to the pretraining model’s better ability to learn disease text data features through fine-tuning, which provides stronger domain adaptation and more effective encoding of semantic vectors. The adoption of the feature fusion module, with its ensuing 4.96% improvement, demonstrates that fusing the relative position features of the head entity’s first and last characters, local sentence features, and contextual features through CLN enables deeper feature integration and enriches semantic representations beyond the mere addition of head entity and sentence features. This confirms the efficacy of the BRC module. Moreover, integrating the GHM loss function results in a 2.77% improvement over the baseline model; this suggests the GHM loss function rectifies, to some extent, the issue of imbalanced aquatic disease data categories. When these three enhancements are applied in tandem, the model attains its peak F1 score of 84.71% for the aquatic disease dataset, indicating a synergistic effect that collectively boosts the model’s extraction capabilities.

In this paper, the inputs to the proposed feature fusion module are processed through the self-attention mechanism, a BiLSTM network, and a multi-head mechanism that incorporates entity position computation. To further assess the impact of these components, we conducted another ablation experiment as follows: BRC represents the model that entails the complete feature fusion module; -SA refers to the model with the self-attention mechanism removed; -BiLSTM denotes the removal of the BiLSTM network; -RPE corresponds to the exclusion of the relative position encoding, thus relying solely on the head and tail word vectors of the head entity as conditional inputs for the CLN.

As evinced by Table 10, the removal of various components from the BRC model leads to a reduction in F1 scores to varying degrees. Among them, removing the self-attention mechanism encoding results in a 1.66% decrease in the model’s F1 score. This suggests that the self-attention mechanism enhances the interaction between the BERT encoding layer and the BiLSTM layer, effectively modeling the distant dependencies within the sentence while emphasizing its local key features. This, in turn, strengthens the contextual representation of the BiLSTM encoding. Moreover, eliminating the BiLSTM leads to a significant drop in the F1 score by 3.82% compared to the baseline. This highlights the importance of contextual features for the task of relationship extraction, which necessitates parsing the sentence structure. Additionally, the F1 score decreases by 1.35% after removing the relative coding of entity positions, indicating that incorporating the positional attributes of the head entity into the model can somewhat improve its ability to localize the head entity. This enhancement aids in better extraction of relationships and tail entities, thereby boosting the overall performance of the model. Overall, the experimental results demonstrate that incorporating additional features such as local key features, contextual sentence features, and relative positional attributes of head entities can effectively enhance the model’s performance.

5.10. Construction of the Knowledge Graph

Next, we can use the Fd-CasBGRel model to extract ternary data related to aquatic diseases and store that in a CSV file. By then importing the ternary data into the Neo4j graph database using Neo4j’s import command, we can begin building a knowledge graph. Statistical analysis reveals that this preliminary construction of the aquatic disease knowledge graph has a total of 20,469 ternary data items. It covers 33 types of entity categories and 32 types of relationship categories. Figure 12 shows a portion of the visualized knowledge graph for the aquatic disease domain.

6. Conclusions

This paper addresses the challenges of relationship overlap, corpus specialization, low feature fusion, and data imbalance in the task of extracting entity relationships for aquatic diseases, all of which diminish the efficacy of relationship extraction. To deal with these issues, we adopt a cascading binary labeling framework as the overarching architecture to enhance the extraction of overlapping relationship triplets and overcome the issue of overlapping relationships in aquatic diseases. Further, the pretrained model in the encoding layer is fine-tuned with a corpus related to aquatic diseases, augmenting the model’s capacity for the semantic encoding of aquatic disease texts. Building on this foundation, we introduce the BRC feature fusion module, which deeply integrates the relative position features of the head entity’s first and last characters, local sentence features, and contextual sentence features via a self-attention mechanism, a BiLSTM network, and a conditional layer normalization process. This module significantly strengthens the model’s feature fusion capabilities in the extraction task, enriches semantic representations, and, thus, amplifies the extraction effect. Additionally, the GHM loss function is implemented to help rectify the data imbalance issue. Finally, the refined model is used to successfully initiate the preliminary construction of a knowledge graph for aquatic diseases, evincing the model’s practical applicability and effectiveness in improving the extraction of aquatic disease entity relationships. The experimental findings demonstrate that the model introduced in this study achieves the highest F1 score of 84.71% when run using the dataset for aquatic disease entity–relationship extraction. Notably, it has an F1 score of 86.52% within the category of data involving overlapping entities, greatly outperforming several established mainstream models for entity–relationship extraction. This highlights the model’s comprehensive effectiveness in addressing the aforementioned challenges. In addition to the self-built aquatic disease dataset, comparative experiments were also conducted on the public English dataset WebNLG. The results showed that the model achieved good results and demonstrated its ability to be applied across language fields.

While location features play a pivotal role, this study acknowledges that other features, both lexical and syntactic, remain underexploited. Investigating how to effectively mine and integrate them to bolster performance is now a chief focus of future research. Also, knowledge graphs’ application within the agricultural sector is relatively unexplored, with most attempts centered on question-and-answer scenarios. The potential for utilizing knowledge graphs in conjunction with real-time monitoring of environmental parameters and fish growth to drive early prediction and warning systems for aquatic diseases represents a promising research direction that warrants further investigation.

Author Contributions

Conceptualization, H.Y.; methodology, L.L.; validation, L.L. and C.Z.; formal analysis, H.Y. and D.S.; investigation, L.L.; resources, C.Z.; data curation, D.S.; writing—original draft preparation, L.L.; writing—review and editing, C.Z. and D.S.; funding acquisition, H.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Key R&D Program of Zhejiang (2023C02029) and the Agricultural Technology Cooperation Program in Zhejiang Province of China (2022SNJF068).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Appendix A.1. Types of Named Entities

We categorized the entities in the dataset into 33 types based on the aquatic disease knowledge system. The specific types and their meanings are detailed in Table A1.

Table A1. Types of entities.

ID	Names	Description
1	Pathogens	Pathogenic micro-organisms responsible for causing diseases
2	diseases	Disease Information for Chinese Audiences
3	pathogen name	Scientific Name of the Pathogen
4	biological classification	Biological classification of pathogens.
5	characteristics of the pathogen	External characteristics of pathogenic microorganisms.
6	colony characteristics	External characteristics of the colony, including color, shape, and other features.
7	structure of the pathogen	Structural composition of pathogen.
8	types of spores	Types of Spores Found in Fungi.
9	substances sensitive to the pathogen	Substances detrimental to pathogens.
10	identification experiments	Experiments for Identifying Pathogenic Microorganisms
11	gram staining	Gram Staining Characteristics of Bacteria.
12	culture medium	Types of Media Used for Culturing Bacteria and Fungi
13	proliferating cell lines	Type of Cell Line Used for Culturing the Virus.
14	stages of life history	Stages of Growth and Development in Pathogenic Micro-organisms.
15	optimal growth water temperature	Optimal Water Temperature for the Growth and Development of Pathogenic Micro-organisms.
16	average growth water temperature	Average Water Temperature Required for the Growth and Development of Pathogenic Micro-organisms.
17	disease name	Scientific Name of the Disease.
18	disease aliases	Disease Information for Aliases.
19	affected locations	Symptoms and Sites of Pathological Manifestations of the Disease.
20	symptoms	Macroscopic Manifestations of the Disease.
21	pathological manifestations	Information on Micropathological Manifestations of the Disease.
22	causative factors	Factors Predisposing to Disease.
23	modes of transmission	Modes of Disease Transmission.
24	susceptibility stages	Disease Information for Stages of Growth and Development in Aquatic Animals Susceptible to Diseases.
25	times of prevalence	Seasonal Prevalence of the Disease.
26	epidemic regions	Countries or Areas Where the Disease is Endemic.
27	Optimal water temperatures for disease onset	Optimal Water Temperature Required for Disease Prevalence.
28	Average water temperatures for disease onset	Average Water Temperature Required for Disease Prevalence.
29	species at risk	Aquatic Animals Susceptible to a Specific Disease.
30	diagnostic methods	Methods of Diagnosing Disease.
31	preventive measures	Preventive Measures for Specific Diseases.
32	treatment methods	Treatment of Specific Diseases.
33	preventive and curative medications	Name of the Medication Used for Prevention or Treatment of a Disease.

Appendix A.2. Types of Named Entities

Based on the predefined entity categories, the relationship categories of the aquatic disease knowledge graph were established, consisting of 32 types. The specific categories and their meanings are detailed in Table A2.

Table A2. Types of relations.

ID	Name	Description
1	Production	Production of Macroscopic Symptoms in Disease
2	Hazard	Which specific species are most susceptible to the disease
3	Lesion Site	What are the common sites for the appearance of disease symptoms or pathology
4	Characterization	What physical characteristics define this pathogen
5	Structure	What are the names of the structural features of the pathogen
6	Prevention	What specific preventive measures are recommended for this disease
7	Formation	What are the micropathological manifestations associated with disease development
8	Treatment	What specific treatments are available for this disease
9	Region	In which countries or regions is the disease endemic
10	Control	What are the names of the drugs used for controlling the disease
11	Susceptibility	At which developmental stage are aquatic animals most susceptible to disease outbreaks
12	Diagnosis	What specific diagnostic measures are used to identify this disease
13	Cause	What epidemics can a particular pathogen potentially cause
14	Affiliation	To which biological categories do pathogens belong
15	Time	At what point in time do diseases typically manifest
16	Predisposing factors	Factors to Disease Epidemics
17	Pathogen scientific name	What is the scientific name of the pathogen
18	Life history stage	What are the main stages in the life cycle of pathogens
19	Colony characterization	What are the external characteristics of bacterial and fungal colonies
20	Dissemination	What are the modes of transmission for the disease
21	Possession	What types of spores do fungi produce
22	Disease scientific name	What is the scientific name of the disease
23	Disease aliases	What is an alternative name for the disease
24	Cultivation	What are the media used to culture pathogenic microorganisms
25	Identification	What methods are used to identify the nature of the pathogen
26	Optimal growth water temperature	What is the optimal water temperature for pathogen growth
27	Proliferation	What cell lines are commonly used to culture pathogenic microorganisms
28	Average water temperature for disease onset	What is the typical water temperature range associated with the prevalence of the disease
29	Staining	What are the Gram staining characteristics of bacteria
30	Aversion	To which substances is the pathogen more sensitive
31	Average growth water temperature for disease onset	What is the typical water temperature range conducive to pathogen growth
32	Optimal water temperature for disease onset	What is the optimal water temperature for disease prevalence

References

Construction of marine ranching to enrich the ‘blue granary’. Economic Daily, 5 July 2023; p. 011.
Feng, J.W. The Ministry of Agriculture and Rural Affairs held the ‘14th Five-Year‘ Fishery High Quality Development Promotion Meeting. Farmers’ Daily, 25 August 2022; p. 001. [Google Scholar]
Zhu, X.M.; Zhao, M.J.; Wang, Y.G.; Sun, H.W. Comparative study on edible rate and protein contribution of aquatic products. Chi. Fish Qua Stand. 2021, 11, 32–39. [Google Scholar]
Fensel, D.; Şimşek, U.; Angele, K.; Huaman, E.; Kärle, E.; Panasiuk, O.; Toma, I.; Umbrich, J.; Wahler, A.; Fensel, D.; et al. Introduction: What is a knowledge graph? In Knowledge Graphs: Methodology, Tools and Selected Use Cases; Springer: Cham, Switzerland, 2020; pp. 1–10. [Google Scholar]
Hochreiter, S. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed]
Li, Q.; Ji, H. Incremental joint extraction of entity mentions and relations. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Baltimore, MD, USA, 23–25 June 2014; pp. 402–412. [Google Scholar]
Zang, Y.S.; Liu, S.K.; Liu, Y. Joint Extraction of Entities and Relations Based on Deep Learning: A Survey. Acta Electronica Sinica. 2023, 51, 1093–1116. [Google Scholar]
Miwa, M.; Bansal, M. End-to-end relation extraction using lstms on sequences and tree structures. arXiv 2016, arXiv:1601.00770. [Google Scholar]
Zeng, X.; Zeng, D.; He, S.; Liu, K.; Zhao, J. Extracting relational facts by an end-to-end neural model with copy mechanism. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, 15–20 July 2018; pp. 506–514. [Google Scholar]
Bekoulis, G.; Deleu, J.; Demeester, T.; Develder, C. Joint Entity Recognition and Relation Extraction as a Multi-Head Selection Problem. Expert Syst. Appl. 2018, 114, 34–45. [Google Scholar] [CrossRef]
Bekoulis, G.; Deleu, J.; Demeester, T.; Develder, C. Adversarial training for multi-context joint entity and relation extraction. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 31 October–4 November 2018; pp. 2830–2836. [Google Scholar]
Yu, B.; Zhang, Z.; Shu, X.; Wang, Y.; Liu, T.; Wang, B.; Li, S. Joint extraction of entities and relations based on a novel decomposition strategy. arXiv 2019, arXiv:1909.04273. [Google Scholar]
Zhuang, C.; Zhang, N.; Jin, X.; Li, Z.; Deng, S.; Chen, H. Joint extraction of triple knowledge based on relation priority. In Proceedings of the 2020 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA/BDCloud/SocialCom/SustainCom), Exeter, UK, 17–19 December 2020; pp. 562–569. [Google Scholar]
Zheng, S.; Wang, F.; Bao, H.; Hao, Y.; Zhou, P.; Xu, B. Joint Extraction of Entities and Relations Based on a Novel Tagging Scheme. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Vancouver, BC, Canada, 30 July–4 August 2017. [Google Scholar]
Wei, Z.; Su, J.; Wang, Y.; Tian, Y.; Chang, Y. A novel cascade binary tagging framework for relational triple extraction. arXiv 2019, arXiv:1909.03227. [Google Scholar]
Zheng, H.; Wen, R.; Chen, X.; Yang, Y.; Zhang, Y.; Zhang, Z.; Zhang, N.; Qin, B.; Ming, X.; Zheng, Y. PRGC: Potential Relation and Global Correspondence Based Joint Relational Triple Extraction. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, Vrtual, 1–6 August 2021; pp. 6225–6235. [Google Scholar]
Bai, C.; Pan, L.; Luo, S.; Wu, Z.J.N. Joint extraction of entities and relations by a novel end-to-end model with a double-pointer module. Neurocomputing 2020, 377, 325–333. [Google Scholar] [CrossRef]
Wang, Y.; Yu, B.; Zhang, Y.; Liu, T.; Zhu, H.; Sun, L. TPLinker: Single-stage joint extraction of entities and relations through token pair linking. arXiv 2020, arXiv:2010.13415. [Google Scholar]
Yan, Z.; Zhang, C.; Fu, J.; Zhang, Q.; Wei, Z. A Partition Filter Network for Joint Entity and Relation Extraction. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Online and Punta Cana, Dominican Republic, 7–11 November 2021; pp. 185–197. [Google Scholar]
Tang, W.; Xu, B.; Zhao, Y.; Mao, Z.; Liu, Y.; Liao, Y.; Xie, H. UniRel: Unified Representation and Interaction for Joint Relational Triple Extraction. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Abu Dhabi, United Arab Emirates, 7–11 December 2022; pp. 7087–7099. [Google Scholar]
Wang, S.; Zhang, Y.; Che, W.; Liu, T. Joint extraction of entities and relations based on a novel graph scheme. In Proceedings of the 27th International Joint Conference on Artificial Intelligence, IJCAI 2018, Stockholm, Sweden, 13–19 July 2018; pp. 4461–4467. [Google Scholar]
Fu, T.J.; Li, P.H.; Ma, W.Y. Graphrel: Modeling text as relational graphs for joint entity and relation extraction. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; pp. 1409–1418. [Google Scholar]
Duan, G.; Miao, J.; Huang, T.; Luo, W.; Hu, D. A relational adaptive neural model for joint entity and relation extraction. Front. Neurorobotics 2021, 15, 635492. [Google Scholar] [CrossRef] [PubMed]
Cabot, P.-L.H.; Navigli, R. REBEL: Relation extraction by end-to-end language generation. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2021, Punta Cana, Dominican Republic, 16–20 November 2021; pp. 2370–2381. [Google Scholar]
Ye, H.; Zhang, N.; Deng, S.; Chen, M.; Tan, C.; Huang, F.; Chen, H. Contrastive triple extraction with generative transformer. In Proceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA, 2–9 February 2021; pp. 14257–14265. [Google Scholar]
Lu, Y.; Liu, Q.; Dai, D.; Xiao, X.; Lin, H.; Han, X.; Sun, L.; Wu, H. Unified Structure Generation for Universal Information Extraction. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Dublin, Ireland, 22–27 May 2022; pp. 5755–5772. [Google Scholar]
Sui, D.; Zeng, X.; Chen, Y.; Liu, K.; Zhao, J. Joint Entity and Relation Extraction with Set Prediction Networks. IEEE Trans. Neural Netw. Learn. Syst. 2020. early access. [Google Scholar] [CrossRef] [PubMed]
Yang, H. Entity Recognition and Relation Extraction for the Construction of Fishery Standard Knowledge Graph. Master’s Thesis, Dalian Ocean University, Dalian, China, 2022. [Google Scholar]
Liu, J.S.; Yang, H.N.; Sun, Z.T.; Yang, H.; Shao, L.M.; Yu, H.; Zhang, S.J.; Ye, S.G. Named-entity recognition for the diagnosis and treatment of aquatic animal diseases using knowledge graph construction. Trans. CSAE 2022, 38, 210–217. [Google Scholar]
Jiang, X. Construction of Knowledge Graph to Diagnose for Aquatic Animals Disease. Master’s Thesis, Dalian Ocean University, Dalian, China, 2022. [Google Scholar]
Bi, T.T.; Zhang, S.J.; Sun, X.F.; Wang, S.T.; Wang, W.H.; An, Z.S. A joint extraction method of entity relationships in aquaculture long text using N-Gram fusion. J. Harbin Univ. Sci. Technol. 2024, 46, 1–13. [Google Scholar]
Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. Roberta: A robustly optimized bert pretraining approach. arXiv 2019, arXiv:1907.11692. [Google Scholar]
Li, B.; Liu, Y.; Wang, X. Gradient harmonized single-stage detector. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, January 27–February 1 2019; pp. 8577–8584. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems; Neural Information Processing Systems Foundation, Inc. (NeurIPS): Long Beach, CA, USA, 2017; Volume 30. [Google Scholar]
Shaw, P.; Uszkoreit, J.; Vaswani, A. Self-attention with relative position representations. arXiv 2018, arXiv:1803.02155. [Google Scholar]
Lee, D.; Tian, Z.; Xue, L.; Zhang, N.L. Enhancing content preservation in text style transfer using reverse attention and conditional layer normalization. arXiv 2021, arXiv:2108.00449. [Google Scholar]
Gururangan, S.; Marasović, A.; Swayamdipta, S.; Lo, K.; Beltagy, I.; Downey, D.; Smith, N.A. Don’t stop pretraining: Adapt language models to domains and tasks. arXiv 2020, arXiv:2004.10964. [Google Scholar]
Zhang, L.; Lu, L.; Wang, A.J.; Yang, W. PATB: An information booster for joint entity and relationship extraction. J. Chin. Comput. Syst. 2023, 44, 2338–2345. [Google Scholar]
Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE international Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]

Figure 1. Hierarchy of aquatic disease data used in this study.

Figure 2. Aquatic disease data relationship statistics.

Figure 3. Histogram of sentence lengths across the dataset.

Figure 4. Examples of relationship overlap.

Figure 5. Structure of the constructed Fd-CasBGRel model.

Figure 6. Example inputs for the Bert model.

Figure 7. The CLN module structure.

Figure 8. Impacts of three different loss functions on rectifying the imbalanced dataset.

Figure 9. Comparison of the training process for 10 experimental models on AquaticDiseaseRE.

Figure 10. Experimental results of 10 models for handling overlapping relationships.

Figure 11. Ablation of the training process for experimental models based on Casrel.

Figure 12. Knowledge graph (selected portion) of the aquatic disease domain.

Table 1. Data sources for the corpus of text on aquatic diseases.

ID	Book or Website	Author	Publishers	Year of Publication or Launch
1	Aquatic Animal Disease Diagnosis and Control Technology	Xiaoling Liu	Chemical Industry Press	2009
2	Aquatic Animal Pathology	Wenbin Zhan	China Agriculture Press	2011
3	Aquatic Animal Disease Control Technology	Denglai Li	China Agriculture Press	2019
4	Aquaculture Network	China Society of Fisheries	--	2006
5	Fisheries Expertise Service System	Chinese Academy of Fishery Sciences	--	2018

Table 2. Examples of AquaticDiseaseRE data samples.

Data *	Subject	Relation	Object
Treatment of white spot disease: Administer a 1 mg/L chlorine bleach solution throughout the entire pond.	white spot disease	Treatment	Administer a 1 mg/L chlorine bleach solution throughout the entire pond.
The main target of herpesvirus disease of turbot is the large turbot.	herpesvirus disease of turbot	Hazard	turbot
Lamellodiscusiasis occurs frequently in seabreams reared in Qingdao and Guangdong.	Lamellodiscusiasis	Region	Qingdao; Guangdong
Flavobacterium cloumnare belongs to the family myxobacteriaceae.	Flavobacterium cloumnare	Affiliation	myxobacteriaceae
Aeromonas hydrophila has a short rod-shaped body without spores.	Aeromonas hydrophila	Characterization	short rod-shaped body without spores

* The original data examples are in Chinese and have been translated into English to better serve our readers.

Table 3. AquaticDiseaseRE dataset composition.

Type	AquaticDiseaseRE
Type	Train	Dev	Test
Normal	6708	835	830
SEO	1500	191	196

Table 4. Counting the number of distinctive triplets in a sentence.

Triplet Number	Train	Dev	Test	All
N = 1	3825	454	448	4727
N = 2	2081	286	268	2635
N = 3	1363	175	171	1709
N = 4	632	66	91	789
N ≥ 5	307	45	48	400

Table 5. Model training parameters used.

AquaticDiseaseRE		WebNLG
Parameter	Value	Parameter	Value
Length of hidden vectors	768	Length of hidden vectors	768
Maximum input sentence length	294	Maximum input sentence length	294
Batch size	12	Batch size	6
Dropout	0.5	Dropout	0.5
Number of iterations	60	Number of iterations	100
Learning rate	1 × 10⁻⁵	Learning rate	1 × 10⁻⁵
Entity tagging threshold	0.5	Entity tagging threshold	0.5
Optimizer	Adamw	Optimizer	Adam
Maximum distance for relative position encoding	50	Maximum distance for relative position encoding	50
Number of long attention spans	8	Number of long attention spans	8
BiLSTM network layers	3	BiLSTM network layers	3

Table 6. Evaluation results of different pretrained models.

Pretrained Model	P	R	F1
Bert	73.17	68.77	70.90
Bert-wwm	74.43	70.14	72.22
Bert-wwm-ext	75.26	71.52	73.34
Roberta-wwm-ext	78.92	74.56	76.68

Table 7. Evaluation results for the 10 experimental models.

Model		AquaticDiseaseRE			WebNLG
Model	P	R	F1	P	R	F1
NoverTagging	48.66	38.95	43.27	52.58	19.36	28.30
CopyR-Muti	50.07	33.30	40.00	37.76	36.48	37.11
CopyR-One	48.95	31.51	38.34	32.24	28.91	30.48
GraphyRel-1p	61.01	43.13	50.54	42.36	39.24	40.74
GraphyRel-2p	62.36	44.23	51.75	44.73	41.15	42.87
TPLinker	74.29	71.64	72.94	89.14	84.79	86.91
Casrel	73.17	68.77	70.90	94.17	91.36	92.74
PFN	77.31	74.84	76.05	94.72	92.37	93.53
UniRel	76.32	73.68	74.98	94.88	94.63	94.75
Fd-CasBGRel	83.15	86.33	84.71	-	-	-
Wn-CasBGRel *	-	-	-	96.79	95.46	96.12

*: Wn-CasBGRel denotes the English version of the model in this paper, where the pre-trained model of the coding layer was replaced with the English version and fine-tuned within the English dataset used, and the rest is consistent with the Fd-CasBGRel model.

Table 8. Comparison of F1 scores for models handling data from various ternary group quantities.

Model	N = 1	N = 2	N = 3	N = 4	N ≥ 5
NoverTagging	47.20	37.63	39.88	43.96	41.86
CopyR-Muti	53.39	34.38	38.02	31.74	30.00
CopyR-One	54.78	34.46	33.70	29.74	22.39
GraphyRel-1p	56.51	43.61	48.60	58.82	45.62
GraphyRel-2p	54.98	42.69	51.10	59.29	47.56
TPLinker	67.88	66.94	72.81	80.04	76.47
Casrel	65.86	66.08	71.67	79.11	75.21
PFN	72.59	68.73	76.74	84.14	79.63
UniRel	71.36	67.85	75.27	82.97	78.24
Fd-CasBGRel	82.21	80.78	84.04	90.58	82.94

Table 9. Overall results for the ablation experiment.

Model	P	R	F1
Casrel	73.17	68.77	70.90
Casrel_BRC	76.76	74.98	75.86
Casrel_GHM	74.92	72.46	73.67
Casrel_FDRobertawwmext	83.35	82.56	82.95
Fd-CasBGRel	83.15	86.33	84.71

Table 10. Results of the BRC module ablation experiment.

Model	P	R	F1
BRC	76.76	74.98	75.86
-SA	74.69	73.71	74.20
-BiLSTM	72.74	71.35	72.04
-RPE	75.08	73.94	74.51

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ye, H.; Lv, L.; Zhou, C.; Sun, D. Fd-CasBGRel: A Joint Entity–Relationship Extraction Model for Aquatic Disease Domains. Appl. Sci. 2024, 14, 6147. https://doi.org/10.3390/app14146147

AMA Style

Ye H, Lv L, Zhou C, Sun D. Fd-CasBGRel: A Joint Entity–Relationship Extraction Model for Aquatic Disease Domains. Applied Sciences. 2024; 14(14):6147. https://doi.org/10.3390/app14146147

Chicago/Turabian Style

Ye, Hongbao, Lijian Lv, Chengquan Zhou, and Dawei Sun. 2024. "Fd-CasBGRel: A Joint Entity–Relationship Extraction Model for Aquatic Disease Domains" Applied Sciences 14, no. 14: 6147. https://doi.org/10.3390/app14146147

APA Style

Ye, H., Lv, L., Zhou, C., & Sun, D. (2024). Fd-CasBGRel: A Joint Entity–Relationship Extraction Model for Aquatic Disease Domains. Applied Sciences, 14(14), 6147. https://doi.org/10.3390/app14146147

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Fd-CasBGRel: A Joint Entity–Relationship Extraction Model for Aquatic Disease Domains

Abstract

Featured Application

Abstract

1. Introduction

2. Related Work

3. Aquatic Disease Dataset Construction

3.1. Data Collection

3.2. Pattern Layer Design

3.3. Dataset Segmentation

4. Joint Extraction Model for Aquatic Disease Entity Relationships

4.1. Current Issues

4.2. Model Architecture

4.2.1. FD–Roberta-wwm-ext Encoder

4.2.2. SAB Feature Enhancement Layer

4.2.3. Header Entity Tagger

4.2.4. BRC Feature Fusion Module

Multihead Attention Networks with Relative Position Computation

Conditional Layer Normalization (CLN) Layer

4.2.5. Relationship–Tail Entity Tagger

4.2.6. GHM Loss Function

5. Experiments and Analysis

5.1. Experimental Dataset

5.2. Experimental Environment

5.3. Model Parameters

5.4. Evaluation Metrics

5.5. Pre-Trained Model Replacement Experiment

5.6. Impact of Loss Functions on Unbalanced Data

5.7. Comparative Experiment

5.8. Experiment on Relationship Overlap

5.9. Ablation Experiment

5.10. Construction of the Knowledge Graph

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A

Appendix A.1. Types of Named Entities

Appendix A.2. Types of Named Entities

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI