1. Introduction
Marine ranching, a cornerstone of China’s Blue Granary strategy, has emerged as a transformative approach to modernize marine fisheries, enhance aquaculture productivity, and promote ecological sustainability [
1]. The rapid development of related equipment, such as intelligent feeding systems, deep-sea cages, and multi-functional platforms, has significantly improved operational efficiency [
2,
3]. However, the exponential growth of domain-specific knowledge remains fragmented, with critical information dispersed across heterogeneous sources including enterprise records, experts’ experiences, academic literature, and technical standards. This fragmentation impedes intelligent decision-making, real-time monitoring, and knowledge sharing, thereby limiting the full potential of marine ranching industrialization.
A knowledge graph (KG) is a state-of-the-art semantic network paradigm; it employs graph structures to visualize relationships between entities, demonstrating advantages in intuitiveness, efficiency, and scalability [
4]. The concept of KG was first proposed by Google in 2012 and applied in the search engine domain [
5]. Nowadays, knowledge graphs are classified into general knowledge graphs and vertical knowledge graphs. General knowledge graphs do not target specific domains and have relatively low requirements for the accuracy of knowledge. They emphasize the breadth of knowledge and have a wide coverage. Examples include DBpedia [
6], Yago [
7], Freebase [
8], and Wikidata [
9]. Vertical knowledge graphs, on the other hand, are oriented towards a specific domain and emphasize the depth of knowledge. They have higher requirements for the professionalism and accuracy of knowledge. Examples include IMDB (Internet Movie Database) [
10], MusicBrainz [
11], and Chinese medical knowledge graphs. Recent advancements in vertical knowledge graph applications demonstrate their versatility in various industries. For example, in manufacturing, Ren et al. [
12] automated OPC UA (OLE for Process Control Unified Architecture) information modeling via KG to unify heterogeneous equipment data, while Gu et al. [
13] integrated geometric and assembly process data through a KG-based semantic model (KG-ASM). For agriculture, Wang et al. [
14] constructed a knowledge graph of agricultural engineering technology based on large language model. In the other marine domain, Chen et al. [
15] provide an overview of China’s policies on the development of marine ranching over the past two decades. Their study clarifies the current status, research hotspots, and future directions of marine ranching research. Additionally, Liu et al. [
16] established a knowledge graph construction and application framework for maritime accidents to facilitate the extraction and management of maritime knowledge from unstructured texts.
Despite these advancements, existing research predominantly focuses on structured data from product design or assembly processes, neglecting the unique challenges of marine equipment domains where unstructured text dominates and entities exhibit complex interdependencies. Traditional extraction methods suffer from error propagation and inefficiency in handling such scenarios, while deep learning models like BERT-BiLSTM-CRF face limitations in parameter efficiency and contextual dependency modeling. Compared with the aforementioned research, this article focuses on the study of joint extraction methods for knowledge in the domain of marine ranching equipment.
By designing targeted questionnaires for diverse users and employees, several limitations in existing vertical KG have been identified as follows: (1) Limited data volume and limited knowledge scope. (2) Ambiguity in the structure of the knowledge framework. (3) Low efficiency and accuracy in knowledge extraction. (4) Difficult updates and maintenance. Consequently, there is an urgent need for a specialized KG framework tailored to marine ranching equipment.
This study proposes the first structured KG framework for marine ranching equipment to bridge the gap between unstructured marine equipment data and knowledge. The main contributions of this study include:
- (1)
Hybrid Ontology Design: A combination of the top-down and the bottom-up approach constructs a domain ontology, defining seven core concepts and eight semantic relationships.
- (2)
Joint Extraction Model: A BERT-BiGRU-CRF model that integrates BERT’s contextual embeddings, BiGRU’s parameter-efficient sequence modeling, and CRF’s global label optimization was developed. A novel TE + SE + Ri + BMESO tagging strategy resolves multi-relation extraction challenges.
- (3)
Dynamic Knowledge Storage: The extracted triples are stored in Neo4j, enabling scalable visualization and real-time updates via Cypher queries.
This work offers a transferable solution for vertical domains. By transforming fragmented data into structured knowledge, our framework supports intelligent applications including equipment fault diagnosis, maintenance planning, and policy formulation.
The remainder of this paper is structured as follows:
Section 2 details the hybrid KG construction methodology.
Section 3 describes the tagging strategy and BERT-BiGRU-CRF model.
Section 4 evaluates experimental results, and
Section 5 concludes with future directions.
2. Hybrid KG Construction Methodology
KG is basically a special semantic network composed of nodes and edges [
17], which can connect different kinds of information together to form a relational network based on the connections between things.
The construction of KG can be divided into three types: top down, bottom up, and the combination of the two. In the top-down approach, the ontology concept layer (i.e., pattern layer) is constructed from the top down to determine the edge of knowledge extraction, and then the entity is added to the knowledge base through the graph construction technology, such as knowledge extraction. In the bottom-up approach, entities, relationships, and attributes with a high confidence coefficient are extracted from data sources and added to the knowledge base. Then, concepts are abstracted from the bottom up to complete the construction of the pattern layer. The method of combining the two is to build the pattern layer from the top down and then to build the data layer from the bottom up. Through the induction and summary of newly acquired data, the entity expansion is realized based on the updated pattern layer [
4].
The KG of marine ranching equipment is vertical KG, which generally adopts the top-down construction approach. However, with the increase in the amount of data, the difficulty of updating and maintaining the graph will become more and more prominent. Therefore, this study adopts a combination of the two methods to construct the KG of marine ranching equipment, which can not only support large data quantity but also ensure the high quality of knowledge.
The process includes data acquisition and preprocessing, knowledge modeling, knowledge extraction, and data storage, which is shown in
Figure 1.
2.1. Data Acquisition and Preprocessing
The primary data sources for constructing the KG include related websites, enterprise production records, expert interview, local and national standards, as well as relevant literature and publications. These data sources are categorized into semi-structured data and unstructured data. To ensure both the quality and quantity of acquired knowledge, distinct acquisition and preprocessing methods are implemented for different data types.
For semi-structured data from websites, the most popular data acquisition technique is web crawling. After comprehensively evaluating popular web crawling frameworks, such as Scrapy (v2.12.0), PySpider, Crawley, and Portia, Scrapy was selected due to its advantages in stability, speed, scalability, modular structure, and low inter-module coupling [
18]. However, raw HTML documents often contain irrelevant content and redundant information, which may compromise the quality and efficiency of subsequent knowledge extraction. Therefore, preprocessing is essential to perform data cleaning and format standardization, ensuring the reliability of knowledge sources for graph construction. The specific workflow is illustrated in
Figure 2.
Step 1: Collect and analyze the related websites, and specify the crawling target.
Step 2: According to the website structure, write corresponding scripts to obtain raw HTML format documents.
Step 3: Use regular expressions to clean the HTML document, remove the advertising, labels, etc.
Step 4: Write a format conversion script and combine with certain manual review (such as clearing spaces, duplicate content, etc.) to sort out the JSON text research document in {“key”: “value”, “key”: [value]} format.
Unstructured data, such as enterprise data, expert interviews, relevant literature, and books, can be divided into electronic text data and paper text data. A text parsing method is used to obtain the electronic text data. The paper text data are obtained by OCR text recognition. In order to facilitate the unified processing of subsequent data, combined with manual audit, the obtained data are cleaned, and the format is converted to obtain the JSON file in the same format as above.
The processed data do not contain any irrelevant content and are subject to certain rules. The “value” containing one value is stored as a string, and the “value” containing multiple values is stored as an array. Each data point represents a type of marine ranching and its associated attributes and attribute values.
2.2. Knowledge Modeling
Knowledge modeling is not only the foundation and preparation work for the construction of KG, but also the premise for the complete construction of a valuable KG. It can effectively organize and utilize useful knowledge in massive information to build a unified knowledge model that is convenient for computer processing [
19]. Ontology is a modeling tool that describes domain concepts, which can ensure that the graph has good structure and redundancy. Therefore, this study adopts the ontology-based modeling method to build the pattern layer of the graph.
The graph in this study is a vertical KG. It has higher professional knowledge and accuracy requirements; therefore, it adopts the method of top-down, manual building of a construct ontology. At present, common manual ontology construction methods include the seven-step method [
20], the skeleton method [
21], the METHONTOLOGY method [
22], etc. The seven-step method is the most widely used method at present. It is an iterative ontology modeling method, which is mainly used for the construction of a domain ontology. The advantages are that it has a detailed step-by-step description and strong operability. The steps are as follows: Determine the domain of the ontology; consider whether existing ontologies can be reused; list ontology key items; determine the types and structure of types; determine the attributes of the types; identify the characteristics of the attribute; create an instance. For ontology design, there is no absolutely correct domain ontology construction method, but the most suitable method for a certain application scenario. By referring to ontology construction methods in other domains, this study, combined with the application scenario of the marine ranching equipment domain, optimized the existing seven-step method and finally obtained the construction process of the marine ranching equipment domain ontology, as shown in
Figure 3.
2.2.1. Determine the Domain of the Ontology
Protégé [
23] is an open-source ontology editing tool developed by Stanford University Biomedical Information Research Center based on the Java programming language. When using Protégé to build an ontology, the primary task is to determine the domain of the ontology; that is, we should clarify what the domain covered by the ontology is, what its purpose is, what scenarios it will be applied to, and how to maintain it. In this study, the ontology covers the domain of marine ranching equipment, which is mainly used for the construction of KG. The data in the ontology will be used for intelligent retrieval and question answering. The main maintenance method is to update classes, relationships, and attributes based on the induction and summary of new data. In addition, listing key concepts and terms in the domain gives users and builders a clearer understanding of the entire ontology database. Some of the key concepts and terms are shown in
Table 1.
2.2.2. Determine the Structure and Related Elements
It is necessary to consider and design the concept, attributes, and relations of the marine ranching equipment ontology comprehensively. According to domain investigation and relevant papers, marine ranching equipment can be divided into the following five equipment modules: multi-functional platforms, deep-sea cages, marine ranching observatory, sea fishing boats, and engineering vehicles [
1]. Based on the five equipment modules and the existing data content and characteristics, the remaining four parent concepts are determined: Various types of equipment, marine design criteria, positioning methodology and principal dimensions, among which all kinds of equipment include seven sub-concepts such as security equipment, energy equipment, aquaculture equipment and navigation equipment. In order to further increase the amount of graph data and adapt to the current situation that marine ranching tends to be intelligent, this study further enriches the ontology database and adds two parent concepts, marine ranching equipment knowledge and marine ranching construction. The concept of marine ranching equipment knowledge mainly includes the knowledge of existing demonstration areas of marine ranching and the introduction of various types of marine ranching. The concept of marine ranching construction refers to local standards, including monitoring and evaluation, layout, and distribution of construction norms.
Figure 4 shows the structure.
Since each concept has distinct features and associated data, its attributes must be defined accordingly.
Table 2 lists some attributes of the marine ranching equipment ontology.
In the above ontology structure, in addition to the upper and lower relations between concepts, there are also certain semantic relations among entities contained in sub-concepts. For example, there is a use_condition relationship between the multi-functional platform (“Geng Hai No. 1”) and marine design criteria (“designed water depth (10 m)”), and there is aquaculture relationship between the deep-sea cages (“Jing Hai No. 1”) and the aquaculture equipment (“Automatic bait feeder (1 set)”) and the category of relationships is the same as the range.
Table 3 shows the semantic relationships of the marine ranching equipment body based on the concepts designed in the previous section.
2.2.3. Marine Ranching Equipment Ontology Construction
Finally, the ontology modeling tool Protégé was used to build the ontology of marine ranching equipment. The above defined concepts, related elements and some examples were added to complete the knowledge modeling of the graph, as shown in
Figure 5.
2.3. Knowledge Extraction
Knowledge extraction aims to extract entity–attribute–attribute value triples from different types of acquired data, so as to provide necessary knowledge for the construction of the KG. It is divided into three categories: entity, relation, and attribute extraction. Entity extraction is the most basic and key step in knowledge extraction. Deep learning-based methods are currently the most popular in the domain of entity extraction. Common deep learning models include BiGRU-CRF [
24], BiLSTM-CRF [
25]. In this study, it needs to identify marine ranching equipment entities in the marine ranching equipment text, such as “Geng Hai No. 1” (marine ranching equipment module instance), “Equipment-based Marine Ranching type” (marine ranching type instance), etc. Relation extraction is the process of extracting the relationship between entities on the basis of entity extraction. Attribute extraction refers to the extraction of entity attribute information, such as “Equipment-based Marine Ranching type” and “Geng Hai No. 1”, and generally regards attributes as the relationship between entities and attribute values for extraction [
26].
The extracted objects include semi-structured data and unstructured data. Extraction methods for different types of data are not the same. For semi-structured data, the method of rule-based scripting is used to complete the joint extraction of entity attributes. For unstructured data, the method based on deep learning model is used to complete the joint extraction of entity relations. The specific extraction model is described in
Section 3.
2.4. Knowledge Storage
Knowledge storage should consider the application scenario and data scale, and choose the appropriate storage mode to store the structured knowledge in the database, which can realize the efficient management and analysis of data. At present, knowledge storage can be divided into two kinds according to the storage structure: table-based knowledge storage and graph-based knowledge storage. After a comprehensive analysis, this study adopts the graph-based knowledge storage.
Presently, the predominant graph database systems encompass HyperGraphDB, OrientDB, and Neo4j. Neo4j [
27] is the most popular among them, which can store and query entities, attributes, and relationships in KG, and can also support applications to operate and analyze the KG, which is classified as a property graph within the realm of graph databases. Additionally, it is fundamentally structured around four core components: labels, nodes, relations, and attributes. The effects and description object represented by each constituent are delineated in
Table 4. In contrast to alternative graph database systems, Neo4j boasts superior scalability, the capacity to accommodate millions of nodes on standard hardware configurations, the availability of the Cypher query language, and compatibility with a multitude of popular programming languages. Consequently, Neo4j has been chosen as the database platform for the storage and maintenance of the graph structure in the present study.
This study is based on the py2neo library in Python (v3.9.13), and through script writing, it facilitates the batch import of triples, such as (entity, relationship, entity) and (entity, attribute, attribute value). The Neo4j-based KG encapsulated 2153 nodes and 3872 edges.
Figure 6 shows partial content.
In
Figure 6, the nodes are distinguished by varying colors corresponding to disparate conceptual instances, while the varying edges interlinking these nodes signify the relational aspects. Owing to the openness of the KG and the commendable scalability of the Neo4j database, it is anticipated that the KG established in this research can be systematically enriched and augmented via Cypher query language statements. This will, in turn, provide a robust foundation for subsequent applications in equipment fault diagnosis, maintenance planning, and policy formulation.
3. Joint Extraction of Knowledge in the Domain of Marine Ranching Equipment
Knowledge extraction aims to extract entity–attribute–attribute value triples from different types of data obtained, so as to provide necessary knowledge for the construction of KG. As mentioned above, for semi-structured data, rule-based joint extraction is used to complete the joint extraction, for unstructured data, the method based on deep learning model is used to complete the joint extraction.
3.1. Rule-Based Joint Extraction of Entity Attributes
The dataset analyzed in the preceding section is characterized as semi-structured data, adhering to specific rules. For certain data entries, the initial key–value pair enclosed within each set of curly braces denotes the category to which the entity pertains, as well as the entity’s name. Subsequent key–value pairs are archived in the format “attribute”: “attribute value”, with each pair pertaining to the entity identified in the initial key–value pair. Empirical validation has confirmed that this structured rule facilitates the extraction of entity–attribute–attribute value triples (e.g., marine ranching equipment entity–attribute 1–attribute value 1; marine ranching equipment entity–attribute 2–attribute value 2; …; marine ranching equipment entity–attribute n–attribute value n).
In order to enhance the presentation and utility of the KG, the current study introduces a method to normalize multi-valued attributes into entities. A segment of the KG is illustrated in
Figure 7.
3.2. Joint Entity Relation Extraction Based on Deep Learning Models
3.2.1. The Innovative TE + SE + Ri + BMESO Tagging Strategy
Upon examining the “JSON” dataset that was previously processed, it has been ascertained that the “value” domain comprises a substantial portion of unstructured textual data, which concurrently harbors numerous cryptic interconnections among entities. For example, within the “value” segment of the “introduction” (profile) attribute for “Jing Hai No. 1”, there are intricate entity relationships pertaining to principal dimensions, aquaculture equipment, and marine design criteria.
Drawing upon the comprehensive analysis of the marine ranching equipment corpus, in conjunction with the interrelations delineated within the model layer, several distinctive attributes have been elucidated: (1) The extraction tasks for this iteration are unanimously centered around the conceptual entity of the marine ranching equipment module, thereby designating the marine ranching equipment module entity as the theme entity within the extracted triples; (2) The relations between the marine ranching equipment module entity and other entities, as well as the categorization of these other entities, remain consistent. The identification of the entity types facilitates the determination of their respective relationships; (3) In a sentence, it is feasible to encounter multiple relationships between the marine ranching equipment module entity and other diverse entities.
Drawing upon the preceding analytical insights, the current study introduces an innovative tagging strategy, designated as TE + SE + R
i + BMESO, which is specifically tailored for the marine ranching equipment corpus. The study employs the BERT-BiGRU-CRF entity extraction model to concurrently identify and extract inter-entity relationships. Within the context of this tagging schema, the marine ranching equipment module entity is denoted as the theme entity, represented by TE. Entities that interact with the marine ranching equipment module entity are denoted by SE_ R
i, with SE denoting the secondary entity and R
i indicating the category of the i-th secondary entity SE
i, which corresponds to the relationship type linking the theme entity to SE
i. The BMESO sequence labeling approach is utilized, with the detailed connotations of each label delineated in
Table 5.
For example, in the input sequence “Jing Hai Yi Hao Pei Bei Xi Wang Ji”, in this sentence, “Jing” serves as the beginning of the theme entity, corresponding to Tag “B-TE”, “Hai Yi” is in the middle of the theme entity, corresponding to Tag “M-TE”, “Hao” is at the end of the theme entity, corresponding to Tag “E-TE”, “Pei Bei” serves a connecting role and has no specific meaning, so it corresponds to Tag “O”. “Xi”, “Wang”, “Ji”, respectively, correspond to the beginning, middle and ending of the secondary entity. Similarly, they can be marked as “B-SE_Ri, M-SE_Ri, E-SE_Ri”.
3.2.2. The Specific Structure and Working Principle of the BERT-BiGRU-CRF Model
The BERT model [
18] is a widely adopted pre-trained language model in natural language processing (NLP) in recent years, demonstrating exceptional performance in text representation and semantic understanding. The BERT-BiGRU-CRF architecture comprises three layers: a BERT layer, a Bidirectional Gated Recurrent Unit (BiGRU) layer, and a Conditional Random Field (CRF) layer through their integrated BERT-BiGRU-CRF approach. The overall model is shown in
Figure 8.
Compared with the traditional model, BiLSTM-CRF, this study adopts BiGRU to reduce the parameter quantity (by 18%) and explicitly models the label dependency between ‘B-TE’ and ‘E-SE_Ri’ through the CRF layer to solve the problem of nested entities of marine equipment.
- (1)
BERT layer
BERT is a context-based word embedding model and its structure is delineated in
Figure 9.
The execution of the BERT layer predominantly encompasses two pivotal components: the representation of input data and pre-training procedures. The representation of input data pertains to the transformation of the data into a format compatible with BERT’s input requirements. Each character within the input data is the aggregate of token embeddings, segment embeddings, and position embeddings, as shown in
Figure 10.
BERT is pre-trained based on two major tasks: “Masked Language Model” (MLM) and “Next Sentence Prediction” (NSP). Through the simultaneous training of these two tasks, it can better extract word-level and sentence-level features of the text, obtaining token embeddings that contain more semantic information.
- (2)
BIGRU layer
To concurrently acquire contextual insights, the current study has developed a Bidirectional Gated Recurrent Unit (BiGRU) network, where the fundamental building block consists of both forward and reversed GRU elements. The detailed model is delineated in
Figure 11.
In
Figure 11, the variable x
t denotes the input data at the current instance, whereas h
t signifies the output at the current instance, and h
t−1 denotes the output at the previous instance. Concurrently, the reset gate r
t and the update gate z
t operate synergistically to regulate the previous hidden state h
t−1 and to facilitate its transition into the new hidden state h
t. The reset gate combines h
t−1 and x
t, producing a matrix r
t, the elements of which range from 0 to 1. The specific formula is as follows:
The update gate combines h
t−1 and x
t to control how much information from the previous step’s output h
t−1 is retained. The specific formula is as follows.
The candidate memory
consists of two components: one part is the current input data x
t, and the other part is the output h
t−1 from the previous moment, determined by the reset gate r
t. The specific formula is as follows:
The matrices W
r, W
z and W
h denote the weight matrices, while b
r, b
z, and b
h represent the biases. The ultimate output is determined by the update gate, and the formula for this computation is as follows:
- (3)
CRF layer
In the domain of named entity recognition, it is imperative to acknowledge the existence of interdependencies among the labels. For example, the label “B-TE” is invariably followed by the proscription of the label “M-SE_AQ”. Nonetheless, the BiGRU model inadvertently conforms to a strategy of selecting the label with the highest probability as the predicted outcome, without due consideration for the inter-label constraints. In response to this challenge, the current study introduces the integration of a CRF (Conditional Random Field) layer. During the label prediction phase, this CRF layer meticulously evaluates both the individual probabilities of each label and the transition probabilities derived from the training corpus, effectively mitigating the likelihood of illicit labels and enhancing the precision of the predictive outcomes.
Suppose the input sequence is X = {x
1, x
2, x
3, …, x
n}, and the output label sequence is y = {y
1, y
2, y
3, …, y
n}. The score calculation formula is as follows:
In the formula, A is a transfer matrix of size (k + 2) × (k + 2), where Ai,j represents the score of label i transferring to label j. P is the output matrix of the BiGRU layer, with a size of n × k, where n indicates the sentence length and k represents the number of labels. pi,j denotes the score of the i-th word being marked as the j-th label. In order to derive the probabilities corresponding to all potential tag sequence scores, the softmax function is employed. The mathematical formula is delineated as follows:
In the given formula, Y
X denotes the comprehensive set of all feasible label sequences corresponding to the input sequence X, with y~ representing the actual label sequence. Subsequently, a logarithmic transformation is applied to both sides of the equation, followed by the application of the Viterbi algorithm for decoding, thus identifying the sequence with the highest scoring value. The detailed computational formula is delineated as follows:
4. Results and Analysis
4.1. System Testing Environment
In the context of this investigative endeavor, the experimental procedures were executed utilizing the Python and PyTorch frameworks. The corresponding software and hardware configuration is delineated in
Table 6.
This investigation assesses the efficacy of the model by employing three pivotal performance metrics indigenous to the domain of knowledge extraction, viz., precision (P), recall (R), and the F1 score.
4.2. Experimental Data and Parameters
In this study, the dataset consists of 1456 annotated sentences specifically related to marine ranching equipment. To garner more profound insights, the dataset was subdivided utilizing the cross-validation methodology, adhering to a proportion of 8:1:1, which yielded a training corpus of 1164 sentences, a validation corpus of 146 sentences, and a test corpus of 146 sentences. Following an array of parameter fine-tuning trainings, the optimal parameter settings are outlined in
Table 7.
4.3. Experimental Results
In order to substantiate the superiority of the model developed within the scope of this research, a comparative analysis was conducted against the prevalent algorithmic model, BiLSTM-CRF, and BERT-BiLSTM-CRF, in the domain of knowledge extraction. For enhanced stability in precision assessment, each extraction experiment was performed 10 times to determine average values and variances. The detailed outcomes of these experiments are delineated in
Table 8.
4.4. Experimental Analysis
4.4.1. Comparison with Models
As illustrated in
Table 8, experimental results demonstrated superior performance over the BiLSTM-CRF and BERT-BiLSTM-CRF models, achieving 86.58% precision, 77.82% recall, and 81.97% F1 score. Specifically, compared with the BiLSTM-CRF model, the precision, recall, and F1 score have increased by 9.38%, 9.37%, and 9.45%, due to BERT providing deep semantic representations and solving the problems of polysemy and data sparsity. Compared with the BERT-BiLSTM-CRF model, the precision, recall, and F1 score have increased by 1.21%, 2.50%, and 1.94%, respectively. It indicates that, in contrast to the BiLSTM model, the BiGRU model boasts a reduced parameter count, which not only enhances model performance but also accelerates the training process. The utilization of BiGRU as the encoding layer is deemed more appropriate for the text-based entity recognition task specific to marine ranching equipment.
4.4.2. Different Entity Recognition Results
In order to deepen our comprehension of the joint extraction of knowledge in the domain of marine ranching equipment, we performed a comprehensive evaluation of the BERT-BiGRU-CRF architecture’s efficacy in identifying diverse entity categories. The recognition efficacy pertaining to diverse entities is graphically shown in
Figure 12.
An analysis of
Figure 12 and associated data reveals that the F1 score for the marine ranching equipment module entity is the highest at 92.17%. Notably, entities such as marine design criteria, positioning methodology, and principal dimensions exhibit considerable F1 scores at 87.12%, 88.81%, and 89.85%, respectively. The reason might be that these entities are relatively few in number and their grammatical structures are complete and clear, thus making them less difficult to identify.
The presence of diverse nomenclature for various equipment entities, such as “batch feeder”, “bait dispenser”, and “GQ48902”, all referring to the “feeding machine” in the context of aquaculture equipment, leads to an increased prevalence of unrecognizable entities and manifests as a relatively low recall rate. Consequently, this impacts the overall performance of the model, yielding a lower recall rate and F1 score.
This study is the first to construct an entity recognition task within the domain of marine ranching. Currently, there are no comparable research results in the same domain to make relevant comparisons. In the related marine domain, Lv et al. [
28] proposed an improved YOLO v5 target detection algorithm; its experimental results show that the values of mAP and F1 of the improved YOLO v5 target detection algorithm are 72.1% and 0.722, respectively, which are better than other target detection algorithms in terms of accuracy and reliability. Cao et al. [
29] proposed the method of combining a neural network with the statistical model (BiLSTM-CRF) to identify marine drugs. The accuracy rate, the recall rate, and the F1 score are 72.23%, 66.76%, and 68.57%, respectively. The F1 score of this study is much higher than the standard level, indicating that the F1 score of 81.97% obtained in this study has reached the practical level.
However, there are also some limitations of the method and the dataset. First, the current corpus only contains 1456 labeled sentences, and it is necessary to expand the multi-language data. Furthermore, the knowledge graph has not yet integrated real-time data streams. It is planned to optimize it by combining the incremental learning function of the graph database.
4.5. Intelligent Question-Answering System for Marine Ranching Equipment
Based on the above knowledge graph, this study designed an intelligent question-answering system. The intelligent question-answering page of the system is shown in the
Figure 13, which includes a search box and a result box. Users input questions in the search box and click the search button. Based on the marine ranching equipment knowledge graph and the intelligent question-answering algorithm, corresponding retrieval results will appear in the result box. To enhance the convenience of the system, this module adds a question recommendation function, which can be refreshed in real time to provide users with the questions they want to ask.
For the user input question, firstly, the question is segmented and its part-of-speech tagged by using the Jieba library and the custom entity dictionary, and the main entity in the question is obtained through the script. Then, the question is classified based on the BERT model to obtain the corresponding semantics of the question. Finally, the graph is retrieved through the Cypher query template to obtain the answer to the question.
5. Conclusions and Prospects
In essence, this study not only proposes the first structured KG framework for marine ranching equipment but also offers a transferable methodology for vertical domain knowledge extraction, yielding successful outcomes. The present study employs a hierarchical, top-down methodology to establish the model layer of marine ranching equipment, culminating in the formulation of an ontological framework for the marine ranching equipment KG. Subsequently, a bottom-up method is enacted to develop the corresponding data layer, facilitating the comprehensive acquisition of marine ranching equipment data and the subsequent extraction. Thereafter, the BERT-BiGRU-CRF model is used to accomplish the joint extraction of entity relationships within the marine ranching equipment domain. Ultimately, the graph data are stored within the Neo4j database. The conclusions are summarized as follows:
The Neo4j-based KG encapsulated 2153 nodes and 3872 edges, enabling scalable visualization and dynamic updates. Experimental results demonstrated superior performance over BiLSTM-CRF and BERT-BiLSTM-CRF, achieving 86.58% precision, 77.82% recall, and 81.97% F1 score.
By comparing with the existing knowledge graphs in the marine domain, it is pointed out that the existing knowledge graphs mainly focus on a maritime accident analysis and other aspects, while this study focuses on the extraction of entity relationships of marine ranching equipment. Marine ranching equipment is an important condition for developing deep-sea aquaculture. However, its grammar is complex, with problems such as fragmentation and information islands. This research fills the gap in this vertical domain.
Currently, there is still room for improvement in this study, and it will be further deepened and expanded from the following aspects:
In the initial phase, due to the fact that marine ranching equipment is a domain of knowledge and its development time is relatively short, the domain corpora for this domain are not very abundant. The training data of this article mostly comes from the Internet, the instructions of related companies, and some documents. Currently, we are collaborating with enterprise partners to expand multilingual corpora (such as English technical manuals) and discussing the increase in data volume through the fusion of multiple data modalities.
Furthermore, it is imperative to address the dynamic and continuous evolution of marine ranching equipment, which is characterized by rapid advancement and an ever-expanding body of knowledge. The capacity to capture data in real-time and facilitate the dynamic updating of the marine ranching equipment KG represents a pivotal challenge for future endeavors.
Finally, in recent years, language models based on large-scale corpora and pre-training techniques have become a research hotspot in the domain of natural language processing. For instance, the GPT series models proposed by OpenAI and the BERT model developed by Google, among which the BERT model has achieved outstanding results in various NLP tasks, such as sentence classification and intelligent question answering. The method in this study can be transferred to other vertical domains, and the future work will focus on integrating with large language models for application scenarios.