1. Introduction
China leads the world in aquaculture and had a total output of 68.65 million tons of aquatic products in 2022 alone [
1]. Of this, 55.65 million tons are directly attributed to aquaculture production, whose ratio to fishing production is at least 81:19. This substantial contribution from aquaculture not only ensures national food security and the supply of vital agricultural products but also plays a pivotal role in boosting farmers’ income [
2,
3]. However, factors such as high-density stocking and water pollution have led to a high incidence of disease during aquaculture. Accordingly, the diagnosis and prevention of aquatic diseases have emerged as critical bottlenecks hindering the industry’s rapid and sustainable development. It is imperative to leverage modern information technology to empower traditional aquaculture practices and hasten the transition towards digitalization and intelligence in disease diagnosis and prevention techniques.
Relationship extraction is a fundamental step when constructing a knowledge graph [
4], which aims to identify entity–relationship ternaries (HEAD, RELATIONSHIP, TAIL) from unstructured text. In the aquatic disease domain, that relationship extraction primarily involves extracting various entities, such as established diseases, control methods, drugs, and pathogens, from textual data related to aquatic diseases, as well as elucidating the relationships among them. These extracted data are then organized into a knowledge graph, which can be readily used in various downstream applications including intelligent Q&A systems, early disease warning systems, and knowledge visualization tools. By integrating these technologies into the aquatic disease control process, digitalization can be enhanced, thereby improving overall disease management efforts.
For the entity–relationship extraction task within the aquatic disease domain, its extensive overlap in relationships poses a significant issue [
5] that impacts the effectiveness of ternary group extraction. Further, this domain’s data exhibit notable specialization and imbalance, being characterized by the prevalence of many specialized terms and an imbalance between data categories and classification difficulty levels. Consequently, generalized pretraining language models encounter challenges in extracting semantic features and have poor domain adaptability, thereby adversely affecting extraction performance. In addition, in the process of entity–relationship extraction, many supplementary features beyond the encoded features of sentences and entities remain underutilized, leading to feature loss. To address these challenges, this paper proposes a cascading binary labeling framework based on a fine-tuned pretraining model, a feature fusion module, and a GHM loss function. We show that this framework can effectively resolve the aforementioned issues and bolsters model performance on small-scale, unbalanced aquatic disease-relationship extraction tasks.
2. Related Work
Entity–relationship extraction is instrumental in the construction of knowledge graphs. The traditional pipeline method for doing that divides the entity and relationship extraction task into separate subtasks, initially extracting entity pairs and later categorizing them based on relationship labels. The main drawback of this approach is that it ignores the potential correlation between these subtasks, which leads to error propagation [
6]. In contrast, joint entity–relationship extraction integrates entity recognition and relationship recognition, effectively solving the error propagation problem. Early joint entity–relationship extraction methods include feature engineering methods [
7], tree-structured joint models [
8], and joint extraction methods based on sequence annotation [
9,
10,
11]. These methods have drawbacks such as poor portability and complex data labeling and feature construction. Further, the key problem of relationship overlap, particularly prevalent in specialized domains like aquatic sciences, is hard to solve using the aforementioned approaches.
In this context, Yu et al. [
12], Zhuang et al. [
13], and Zeng et al. [
14] were able to successfully recognize overlapping relation triplets by augmenting the labeling strategy. Yet these approaches still treat relations as discrete labels of entities, which inevitably limits their performance. Wei et al. [
15] devised a pointer network to comprehensively model triplets by learning the mapping function linking relations to entities. To address the Casrel model’s low computational efficiency due to relationship redundancy, Zheng et al. [
16] introduced the PRGC model, which breaks down the task into three steps: relationship judgment, entity extraction, and entity alignment. This approach effectively enhances computational efficiency. Models for the extraction of overlapping relationships based on copying mechanisms rely on various feature extraction networks to capture semantic features of target entity fragments after localization and then directly copy them to the Decoder for extraction. Representative models for that approach include CopyR [
9] and DPointer [
17], but they have their own drawbacks, such as complex mechanisms and entity relationship mismatches. The table-filling methods maintain a table for each relation, with each item indicating whether the tagged pair displays that particular relation or not, e.g., TPLinker [
18], PFN [
19], and UniRel [
20]. These models are effective in overcoming the problem of resolving overlapping relationships, but they tend to have high model complexity. In recent years, the emergence of Graph Convolutional Neural Networks (GCNNs) has been monumental since their graph data structure effectively conveys various NLP task features. GCNNs have been applied to the study of overlapping relationship extraction, whereby syntactic dependency trees are converted into adjacency matrices and inputted into the graph neural network for relationship extraction. Noteworthy examples include research by Wang et al. [
21], Fu et al. [
22], and Duan et al. [
23]. Relational extraction models based on pretrained language models mainly utilize generative pretrained language models to extract features. Through data fine-tuning, they are capable of achieving adequate performance in downstream tasks. Representative models include REBEL [
24], CGT [
25], UIE [
26], and SPN [
27]. Nevertheless, such pretrained models evidently have non-trivial shortcomings, such as difficulty in handling long text, as well as complex sentence structures, along with high computational and data requirements.
In the field of fisheries, Yang [
28] proposed a BERT-BiLSTM-CRF model that incorporates dual attention mechanisms for words and sentences. This model was designed to enhance the effectiveness of relationship extraction by handling key issues, such as semantic loss in lengthy sequences and the irrational weight allocation of vectors, within the fisheries domain. Additionally, Liu et al. [
29] developed the CaBiLSTM model, employing a layered approach to improve the recognition accuracy of outer layer entities. This approach involves dimensionality reduction of inner entity features to mitigate nested entity obstacles in Aquatic Disease Named Entity Recognition (NER) and integrates BERT for enhanced performance. Jiang [
30] introduced a BIO-based entity relationship joint annotation strategy, which, in combination with the BERT-BiLSTM-CRF model, strengthens the relationship extraction performance. Bi et al. [
31] utilized BERT-BiLSTM to carry out feature extraction from lengthy texts in aquaculture. They applied the N-Gram algorithm to segment features for integration into a cascading BiLSTM model, in this way resolving the issues of information loss and misclassification in long texts to bolster the performance of entity-relation joint extraction.
Previous research in the aquatic domain has mainly focused on extracting information from lengthy texts, identifying nested entities, and conducting the joint extraction of entity relationships. However, the persistent problem of relationship overlap within the aquatic disease domain remains largely unaddressed. The prevalence of abundant overlapping triplets often leads to recognition errors and incomplete extractions, substantially diminishing the effectiveness of relationship extraction. Moreover, the specialized nature of the aquatic disease corpus itself poses stark challenges, namely, the presence of domain-specific terms like disease and a plethora of drug names. This complexity makes it very hard for generic preprocessing models to accurately capture semantic features, resulting in poor domain suitability. Compounding that difficulty, the imbalance in aquatic disease data impedes effective model learning. Lastly, during the extraction process, many features beyond the encoded vectors of sentences will go underutilized, leading to both feature loss and an insufficiently nuanced semantic expression, which reduces model performance. Existing research methods have fallen short of soundly addressing these outstanding challenges. That said, the cascade tagging framework of the Casrel model does show promise in effectively identifying overlapping relationship triads, offering a potential solution to the relationship overlap problem in aquatic disease data. Therefore, this paper adopts the cascading binary labeling framework as its foundational structure. Expanding on that, it integrates a fine-tuned pretraining model, a feature fusion module, and the GHM loss function to construct an entity–relationship joint extraction model tailored to the aquatic disease domain. The main contributions of this approach are summarized as follows:
- (1)
A textual corpus pertaining to aquatic diseases was gathered, from which a dataset was compiled for the aquatic disease relationship extraction. This dataset entails 33 relationship categories and comprises a total of 10,068 data entries.
- (2)
The cascading binary labeling framework was employed to tackle the issue of relationship overlap, while the Roberta-wwm-ext pretrained model was adopted to replace the Bert pretrained model. Next, the model was fine-tuned with domain-specific knowledge related to aquatic diseases. This process of fine-tuning sought to enhance the model’s adaptation to the task of extracting relationships among entities related to aquatic diseases, particularly in scenarios with limited data samples, thus improving its domain suitability.
- (3)
To address the issue of data imbalance, the GHM loss function has been incorporated into the model.
- (4)
To further enhance the degree of feature fusion and enrich the semantic features of the model inputs, this paper introduces a feature fusion module named BRC into the infrastructure. The comprehensive structure combines a self-attention mechanism, a BiLSTM network, a multi-head attention mechanism with relative position encoding, and a conditional layer normalization layer. This module integrates the extracted sentence context features, head entity features, and head entity position features through the conditional layer normalization module, thereby enriching the semantic representation of the overall input and improving the effectiveness of the model’s extraction capabilities.
5. Experiments and Analysis
5.1. Experimental Dataset
In this paper, we mainly study the aquatic disease entity–relationship extraction model and apply it to knowledge graph construction, so the main dataset used is AquaticDiseaseRE. However, it is a Chinese domain, so in order to further validate the model’s paradigm in the English domain, we additionally use the English public dataset WebNLG to conduct the comparative experiments. The sentence sample capacity of the training set, validation set, and test set in the WebNLG dataset is 5019, 500, and 703, respectively, and there is a certain amount of triple overlap, which is similar to the self-built AquaticDiseaseRE dataset.
5.2. Experimental Environment
The experiments described here were conducted on a system running Windows 11. It was equipped with an AMD Ryzen 7 5800H processor with Radeon Graphics at 3.20 GHz, along with 16 GB of RAM and an NVIDIA GeForce RTX 3060 graphics card. The software environment consisted of Python v3.9.13 and the PyTorch framework v1.10.1.
5.3. Model Parameters
We use different parameters to conduct experiments for different datasets. The hyperparameters for the experimental model of this study are in
Table 5.
5.4. Evaluation Metrics
For the task of relation extraction, precision (
P), recall (
R), and the
F1 score were the chief evaluation metrics, as given by Equations (21)–(23).
where
TP is the number of positive samples correctly predicted,
FP is the number of negative samples incorrectly predicted as positive,
FN is the number of positive samples incorrectly predicted as negative, and
TN is the number of negative samples correctly predicted as negative.
5.5. Pre-Trained Model Replacement Experiment
To investigate the impact of various Chinese pretrained language model versions on the performance of an aquatic disease relationship extraction model when utilized as the encoding layer and, thereby, select the most suitable pretrained model for in-task fine-tuning, we tested four distinct pretrained models. Specifically, Bert, Bert-wwm, Bert-wwm-ext, and Roberta-wwm-ext were chosen for this coding layer-substitution experiment whose results are presented in
Table 6.
Evidently, the Roberta-wwm-ext model achieves the highest precision rate (78.92%), recall rate (74.56%), and F1 score (76.68%). Compared to the Bert, Bert-wwm, and Bert-wwm-ext models, the F1 score of Roberta-wwm-ext showed an improvement of 5.78%, 4.46%, and 3.34%, respectively. This superior performance of the Roberta-wwm-ext pretrained model can be attributed to its use of a larger training dataset, a larger batch size, and a whole word-masking mechanism that is better suited to the Chinese language. Collectively, these factors contributed to its advantage over other pre-processing models. Therefore, based on the experimental results, this paper decided upon Roberta-wwm-ext as the foundational model for fine-tuning the encoding layer.
5.6. Impact of Loss Functions on Unbalanced Data
In this study, to address the issue of data imbalance, we implemented the GHM loss function and compared its effectiveness against the Focal Loss and cross-entropy (CE) loss functions as applied to the AquaticDiseaseRE dataset and WebNLG dataset. The Focal Loss was parameterized with
α = 0.5 and
γ = 0.25.
Figure 8 shows the outcomes of these comparisons.
Evidently, the GHM loss value exceeds that of both the CE and Focal Loss. This pronounced difference is due to GHM’s strategy of weighting the gradient of each sample, which ensures a more balanced contribution to the overall loss from gradients of samples varying in difficulty levels. The aquatic disease dataset in this study contains many difficult samples due to its wide-ranging categories and uneven distribution of these samples. Consequently, difficult-to-classify samples constitute a larger fraction of the dataset, increasing the overall loss for the model. Incorporating GHM loss results in a notable enhancement of the model’s overall F1 score vis-à-vis both CE and Focal Loss. Specifically, for the test set, the model using the GHM loss achieves the best performance, showing an improvement of 2.77% and 4.35% over CE and Focal Loss, respectively. This suggests the GHM loss function is able to effectively prioritize the difficult-to-classify samples and reduces the weights for easier ones so as to achieve a better balance of loss, thereby enhancing the model’s predictive accuracy for less represented categories. In contrast, the overall performance of Focal Loss diminishes relative to the CE loss function, likely because of Focal Loss’s excessive focus on hard-to-categorize samples. This overemphasis may cause the model to overfit those outliers.
The WebNLG dataset shows the same rules as the AquaticDiseaseRE dataset. This is because the two datasets have similar characteristics, such as the dataset contains more types of relationships and a large number of overlapping relationship triples. In addition, the loss value fluctuation of the model in the WebNLG dataset is larger than that of the AquaticDiseaseRE. This is because the WebNLG dataset is relatively more complex, with 171 predefined relationship categories, and the proportion of overlapping triples is also larger than that of the AquaticDiseaseRE dataset, and the data are more unbalanced. In general, GHM has good results for datasets with limited and unbalanced samples.
5.7. Comparative Experiment
To verify the effectiveness of the model proposed in this paper, we conducted comparative experiments with several other leading entity relationship extraction models on the AquaticDiseaseRE dataset and the WebNLG dataset:
- (1)
NovelTagging [
14]: A model designed for the joint extraction of entity relations by leveraging innovative annotation strategies;
- (2)
CopyR [
16]: An end-to-end entity relationship extraction model that utilizes a replication mechanism;
- (3)
GraphyRel [
22]: A relationship extraction model whose operation is based upon the structure of relationship graphs;
- (4)
TPLinker [
18]: A labeling framework that utilizes the spans of entity heads and tails;
- (5)
CasRel [
15]: A pointer network model that employs cascading binary tags;
- (6)
PFN [
19]: An entity–relationship extraction model that uses partitioned filter networks;
- (7)
UniRel [
20]: An entity–relationship extraction model that integrates relational semantics.
According to
Table 7, the model introduced in this paper exhibits impressive performance when applied to the aquatic disease relationship extraction dataset. Specifically, its precision (
P), recall (
R), and
F1 score metrics are distinguished by an enhancement of 5.84%, 14.69%, and 8.66%, respectively, in comparison with the next best-performing model, PFN. Furthermore, the results for the test set were consistent with the
F1 value curve for the validation set. The challenge of handling many overlapping relationship triplets in this dataset renders NoverTagging less effective; it cannot properly address the issue of overlapping relationship extraction by solely considering those scenarios where an entity is part of only a single ternary relationship. Meanwhile, the CopyR and GraphyRel series models, which utilize GRU or BiLSTM for encoding, fall short in effectiveness when compared to the Bert pretraining model’s encoding. Although the TPLinker, Casrel, PFN, and UniRel models have marked improvements, the pretraining models they employ are not sufficiently fine-tuned for specialized domains, resulting in a constrained enhancement of their effects relative to the models we proposed in this paper. This outcome further emphasizes that, for highly specific domains, employing fine-tuned pretraining models is essential for augmenting the semantic encoding effect.
In addition, the model proposed in this paper also achieved the best performance on the English public dataset WebNLG, with P, R, and F1 values of 96.79%, 95.46%, and 96.12%, respectively, which are 1.91%, 0.83%, and 1.37% higher than the best-performing UniRel model. Since the WebNLG dataset is public domain data and the data are not very professional, the method of fine-tuning the pre-trained model in the data has a general improvement effect. The experimental results fully demonstrate that the model has potential cross-language application capabilities.
5.8. Experiment on Relationship Overlap
Besides improving the relationship extraction, another critical issue that our model addresses is the problem of relationship overlap. To assess the model’s extraction capabilities under conditions of overlapping relationships, we categorized the test set data into two classes, Normal and Single Entity Overlap (SEO), based on the presence of overlapping relationships. The Normal category has 830 data points, while the SEO category has 196 data points (see
Table 3).
Figure 10 shows the experimental outcomes.
Evidently, the F1 scores of the model proposed in this study are superior to those of the other nine models for both the Normal and SEO categories tested. The performance of those latter models for these two relationship extraction types displays a general reduction, which suggests the task of extracting entity–relationship triplets gets harder with a greater proportion of shared entities in the dataset. Specifically, NoverTagging, CopyR, and GraphRel models undergo a marked decline in F1 scores when dealing with data that includes relationship overlap, indicating their limitations in handling such overlapping forms of data. By contrast, the remaining models actually exhibit slightly improved performance for overlapping relationship data, which we attribute to their respective architectures’ capability to address that key issue.
The Fd-CasBGRel model developed in this study has the best results for both Normal- and SEO-type relationship extraction tasks. It even records higher F1 scores in SEO- than Normal-type data, which could be due to the smaller sample size of SEO-type in the test set. Overall, these experimental outcomes provide compelling evidence that the model presented in this paper is more adept at resolving the challenge of relationship overlap in the task of aquatic disease relationship extraction.
To further validate the model’s extraction performance across scenarios with varying numbers of triplets, we categorized the data into five classes based on their quantity. These results are presented in
Table 8.
A discernible trend emerges where the F1 scores of both NoverTagging and CopyR models diminish to varying extents as the dataset’s triplets increase in frequency. This trend suggests a greater difficulty in processing semantic information and navigating a more complex answer space when more triplets are present, leading to potential confusion and errors during extraction. Additionally, the occurrence of overlapping relation triplets further complicates the extraction process. Conversely, models such as TpLinker, Casrel, PFN, and Unirel have significantly improved F1 scores when the triplet count is between 2 and 4. This better performance is ascribed to these models’ capacity to manage overlapping relational triplets, showcasing their exceptional performance within a specific triplet range. However, as the sentence’s triplet count continues to rise, the F1 scores of those models begin to decline in varying degrees.
The model we developed here achieves the highest F1 score across all five different aquatic disease datasets with varying triplet counts, with a peak F1 score of 90.58% at N = 4. This score not only surpasses several comparative models but also exceeds the best-performing PFN model by 6.44%. This superior performance is credited to the cascading labeling framework employed in this study, which adeptly handles overlapping relationship data across multiple triplets. The integration of a fine-tuned pretraining model and a feature fusion module further enriches the sentence semantic expression and enhances extraction effectiveness. In particular, the incorporated GHM loss function balances well the sample weights and ameliorates the imbalanced data in scenarios involving multiple triplets.
In summary, when assessed alongside several baseline comparative models, this study’s proposed model more effectively addresses the complexities of relationship overlap and multiple triplets in the aquatic disease domain.
5.9. Ablation Experiment
To evaluate the impact of the FD–Roberta-wwm-ext pretrained model, the BRC feature fusion module, and the GHM loss function on aquatic disease entity–relationship extraction, module ablation experiments were also carried out. The training outcomes are conveyed in
Figure 11. Here, Casrel denotes the foundational framework, Casrel_BRC refers to the incorporation of only the feature fusion module, Casrel_GHM corresponds to the integration of solely the GHM loss function within the base framework, Casrel_FDRobertawwmext represents the sole addition of the fine-tuned pretraining model, and Fd-CasBGRel is the model proposed in this paper.
We can see from
Figure 11 that the Casrel model, when augmented with the fine-tuned pretraining model, demonstrates significant improvements over the baseline Casrel model in terms of precision (
P), recall (
R), and
F1 score metrics, as well as in convergence speed. This enhancement can be attributed to the fine-tuned model’s ability to acquire domain-specific knowledge about aquatic diseases and more effectively encode the semantic features of those entities in comparison to the original pretraining model. Including the BRC feature fusion module alone leads to a notable increase in the model’s
p-value relative to the baseline model, suggesting that feature fusion enriches the semantic vector representations, thereby augmenting the accuracy of entity recognition. The R-value undergoes slight improvement, while the overall
F1 score surpasses that of the baseline model. Not surprisingly, incorporating the GHM loss function also contributes to performance gains to a certain extent by mitigating the data imbalance. Ablation experiments were also made on the test set, with these results presented in
Table 9.
Incorporating the feature fusion module, GHM loss function, and fine-tuned pretraining model individually into the model yields F1 scores of 75.86%, 73.67%, and 82.95%, respectively. This corresponds to an enhancement of 4.96%, 2.77%, and 12.05% vis-à-vis the baseline model, indicating that each of the three augmentations is beneficial to the model’s performance. Notably, a greater improvement is gained from fine-tuning the pretraining model than from both the feature fusion module and GHM loss function. This can be ascribed to the pretraining model’s better ability to learn disease text data features through fine-tuning, which provides stronger domain adaptation and more effective encoding of semantic vectors. The adoption of the feature fusion module, with its ensuing 4.96% improvement, demonstrates that fusing the relative position features of the head entity’s first and last characters, local sentence features, and contextual features through CLN enables deeper feature integration and enriches semantic representations beyond the mere addition of head entity and sentence features. This confirms the efficacy of the BRC module. Moreover, integrating the GHM loss function results in a 2.77% improvement over the baseline model; this suggests the GHM loss function rectifies, to some extent, the issue of imbalanced aquatic disease data categories. When these three enhancements are applied in tandem, the model attains its peak F1 score of 84.71% for the aquatic disease dataset, indicating a synergistic effect that collectively boosts the model’s extraction capabilities.
In this paper, the inputs to the proposed feature fusion module are processed through the self-attention mechanism, a BiLSTM network, and a multi-head mechanism that incorporates entity position computation. To further assess the impact of these components, we conducted another ablation experiment as follows: BRC represents the model that entails the complete feature fusion module; -SA refers to the model with the self-attention mechanism removed; -BiLSTM denotes the removal of the BiLSTM network; -RPE corresponds to the exclusion of the relative position encoding, thus relying solely on the head and tail word vectors of the head entity as conditional inputs for the CLN.
As evinced by
Table 10, the removal of various components from the BRC model leads to a reduction in
F1 scores to varying degrees. Among them, removing the self-attention mechanism encoding results in a 1.66% decrease in the model’s F1 score. This suggests that the self-attention mechanism enhances the interaction between the BERT encoding layer and the BiLSTM layer, effectively modeling the distant dependencies within the sentence while emphasizing its local key features. This, in turn, strengthens the contextual representation of the BiLSTM encoding. Moreover, eliminating the BiLSTM leads to a significant drop in the F1 score by 3.82% compared to the baseline. This highlights the importance of contextual features for the task of relationship extraction, which necessitates parsing the sentence structure. Additionally, the F1 score decreases by 1.35% after removing the relative coding of entity positions, indicating that incorporating the positional attributes of the head entity into the model can somewhat improve its ability to localize the head entity. This enhancement aids in better extraction of relationships and tail entities, thereby boosting the overall performance of the model. Overall, the experimental results demonstrate that incorporating additional features such as local key features, contextual sentence features, and relative positional attributes of head entities can effectively enhance the model’s performance.
5.10. Construction of the Knowledge Graph
Next, we can use the Fd-CasBGRel model to extract ternary data related to aquatic diseases and store that in a CSV file. By then importing the ternary data into the Neo4j graph database using Neo4j’s import command, we can begin building a knowledge graph. Statistical analysis reveals that this preliminary construction of the aquatic disease knowledge graph has a total of 20,469 ternary data items. It covers 33 types of entity categories and 32 types of relationship categories.
Figure 12 shows a portion of the visualized knowledge graph for the aquatic disease domain.
6. Conclusions
This paper addresses the challenges of relationship overlap, corpus specialization, low feature fusion, and data imbalance in the task of extracting entity relationships for aquatic diseases, all of which diminish the efficacy of relationship extraction. To deal with these issues, we adopt a cascading binary labeling framework as the overarching architecture to enhance the extraction of overlapping relationship triplets and overcome the issue of overlapping relationships in aquatic diseases. Further, the pretrained model in the encoding layer is fine-tuned with a corpus related to aquatic diseases, augmenting the model’s capacity for the semantic encoding of aquatic disease texts. Building on this foundation, we introduce the BRC feature fusion module, which deeply integrates the relative position features of the head entity’s first and last characters, local sentence features, and contextual sentence features via a self-attention mechanism, a BiLSTM network, and a conditional layer normalization process. This module significantly strengthens the model’s feature fusion capabilities in the extraction task, enriches semantic representations, and, thus, amplifies the extraction effect. Additionally, the GHM loss function is implemented to help rectify the data imbalance issue. Finally, the refined model is used to successfully initiate the preliminary construction of a knowledge graph for aquatic diseases, evincing the model’s practical applicability and effectiveness in improving the extraction of aquatic disease entity relationships. The experimental findings demonstrate that the model introduced in this study achieves the highest F1 score of 84.71% when run using the dataset for aquatic disease entity–relationship extraction. Notably, it has an F1 score of 86.52% within the category of data involving overlapping entities, greatly outperforming several established mainstream models for entity–relationship extraction. This highlights the model’s comprehensive effectiveness in addressing the aforementioned challenges. In addition to the self-built aquatic disease dataset, comparative experiments were also conducted on the public English dataset WebNLG. The results showed that the model achieved good results and demonstrated its ability to be applied across language fields.
While location features play a pivotal role, this study acknowledges that other features, both lexical and syntactic, remain underexploited. Investigating how to effectively mine and integrate them to bolster performance is now a chief focus of future research. Also, knowledge graphs’ application within the agricultural sector is relatively unexplored, with most attempts centered on question-and-answer scenarios. The potential for utilizing knowledge graphs in conjunction with real-time monitoring of environmental parameters and fish growth to drive early prediction and warning systems for aquatic diseases represents a promising research direction that warrants further investigation.