A Novel Approach for the Analysis of Ship Pollution Accidents Using Knowledge Graph

: Ship pollution accidents can cause serious harm to marine ecosystems and economic development. This study proposes a ship pollution accident analysis method based on a knowledge graph to solve the problem that complex accident information is challenging to present clearly. Based on the information of 411 ship pollution accidents along the coast of China, the Word2vec’s word vector models, BERT–BiLSTM–CRF model and BiLSTM–CRF model, were applied to extract entities and relations, and the Neo4j graph database was used for knowledge graph data storage and visualization. Furthermore, the case information retrieval and cause correlation of ship pollution accidents were analyzed by a knowledge graph. This method established 3928 valid entities and 5793 valid relationships, and the extraction accuracy of the entities and relationships was 79.45% and 82.47%, respectively. In addition, through visualization and Cypher language queries, we can clearly understand the logical relationship between accidents and causes and quickly retrieve relevant information. Using the centrality algorithm, we can analyze the degree of in ﬂ uence between accident causes and put forward targeted measures based on the relevant causes, which will help improve accident prevention and emergency response capabilities and strengthen marine environmental protection.


Context and Motivations
With the development of the global economy, international trade is becoming increasingly frequent.Maritime transport plays a vital role in the trade system.Due to the fact that most of the world's goods are transported by sea, this has led to frequent ship accidents, thereby increasing the risk of ship pollution accidents.[1].The occurrence of ship traffic accidents not only poses a significant threat to human life and property safety but also causes severe environmental pollution [2].Therefore, it is necessary to analyze issues related to ship pollution accidents.We play an important role in improving the handling capacity of ship pollution accidents by analyzing accident data.
At present, there are various studies related to ship pollution accidents.However, there are still some shortcomings in the research on data analysis and processing in ship pollution accidents.
The construction of a knowledge graph of ship pollution accidents can help governments, maritime administrations, and rescue agencies respond to ship pollution accidents more quickly and accurately, integrate and analyze historical event data, provide realtime accident information, pollution scope, and impact assessment, and guide emergency rescue and pollution cleanup work.At the same time, it is conducive to the relevant departments to strengthen prevention and monitoring, identify potential risk factors and trends by analyzing the causal data in the knowledge graph, and take preventive measures in advance to reduce the probability of pollution incidents.In addition, the knowledge graph of ship pollution accidents can provide comprehensive data support for governments and decision-makers, helping them to formulate more scientific and effective management policies and regulatory measures.
The paper is divided into five parts.The first part introduces the research status of the knowledge graph and ship pollution accidents.The second part analyzes the basic concepts, construction process, and knowledge storage methods of knowledge graphs.The third part proposes a design framework for the knowledge graph of ship pollution accidents, including the model layer, data layer, and management application layer.The fourth part studies the construction process of a knowledge graph for marine ship pollution accidents, including text vectorization, entity and relationship extraction, experimental environment, model parameter settings, etc., and visualizes the complete pollution accident knowledge graph.The fifth part uses Cypher language for case retrieval analysis, and through the application of centrality algorithms, effectively identifies and explores the inherent logic of ship pollution accidents.

Ship Pollution Incidents
Deng, J. et al., conducted a statistical analysis of accidents in Chinese waters and identified collision and grounding as the primary causes of marine ship pollution accidents [3].Chen, J. et al., suggested that exploring the causality of accidents is essential for correcting and reminding future ship navigation behaviors to reduce the incidence of pollution accidents [4].Heij, C. et al., indicated that human factors, including decision-making errors, resource management deficiencies, non-compliance, lack of skills, and communication errors, are the most significant contributors to marine ship pollution accidents [5].In the marine management field, Zhang, W. et al., proposed that analyzing pollution accident causes through databases and drawing lessons could provide valuable insights for marine transportation safety management [6].Wang, Z. et al., used Bayesian networks to construct a dynamic risk assessment method for offshore platform systems [7].Han, M. et al., used the NVNWAA method to assess accident risks of ship oil spills, offshore oil platform spills, and submarine pipeline spills [8].Sevgili, C. et al., analyzed and predicted the importance of factors influencing tanker accident oil spills using a data-driven Bayesian network (BN) learning model, employing the K2 algorithm and expectation-maximization (EM) algorithm for BN structure and parameter learning [9].Jing, S. et al., investigated the interrelationships among accident attribute factors and predicted the frequency of oil spill accidents using direct calculation and fault tree analysis (FAT) [10].Tang, C. et al., established a Bayesian network model consisting of six root nodes and three intermediate nodes to identify the possibility of potential water pollution risks and derive the sensitive causes of pollution accidents [11].Hou, D. et al., proposed a generalized form of the EP-risk model for river pollution accidents based on a Monte Carlo simulation, the analytic hierarchy process (AHP), and the risk matrix method, which was used for uncertainty analysis of pollutant transport in rivers [12].Zheng, H. et al., established a threedimensional hydrodynamic and water quality model for Danjiangkou Reservoir and put forward scientific suggestions for prevention and emergency treatment based on the simulation results [13].Dong, L. et al., used one-dimensional and MIKE 21 convection-diffusion models to simulate the chemical transport process in coastal rivers and nearshore waters to identify the most dangerous pollution sources and the most vulnerable receptors in the nearshore waters of the North Sea [14].Huang, D. et al., used the weighted association rule mining (WARM) method to study the correlation between the characteristics of marine traffic accidents [15].Sathish, T. et al., used DCNN based on deep learning to detect and analyze coastal pollution in coastal areas [16].

Knowledge Graphs
Knowledge graphs provide a structured way to describe real-world entities, phenomena, and their relationships, making them an excellent method for knowledge management [17].Common software platforms, such as Wikidata 2.7.50464., Google 122.0.6261.6, and YaGo Version 3.0.2,facilitate knowledge graph management and application [18].Knowledge extraction is the core of knowledge graph construction [19], referring to the extraction of structured knowledge from various data sources and structures [20].Knowledge extraction comprises two main tasks: entity extraction and relationship extraction [21].Entity extraction, also known as named entity recognition (NER), involves extracting entities from unstructured text using specific methods.Here, entities refer to objectively existing, distinguishable entities [22].Relationship extraction refers to identifying and obtaining semantic relationships from unstructured text, where relationships denote abstract connections between entities.Relationship extraction is also known as multi-relationship extraction (MRE), primarily extracting multiple types of relationships from single-text data [23].Entities and relationships are stored together as triplets to form a knowledge graph, with triplets representing the carriers of structured knowledge, consisting of a head entity, a relationship, and a tail entity [24].
In terms of knowledge graph construction and application, Liu, J. et al., used knowledge graph technology to explore railway operation accidents [25].Liu, C. et al., constructed a knowledge graph of marine pollution using a transformer-multi-convolutional neural network model's bidirectional encoder [26].Gan, L. et al., analyzed ship collision accidents using a knowledge graph and studied the composition and relevance of factors leading to collision accidents [27].Xie, C. et al., focused on the construction of a spatiotemporal knowledge graph for ship activities [28].Yan, J. et al., proposed a water quality prediction model knowledge graph that combines weighted CNN-LSTM with adversarial learning to predict total nitrogen values [29].Liu, X. et al., proposed a knowledge graph-based oil spill reasoning method that combines rule inference and graph neural network technology, which pre-inferred and eliminated most non-oil spills using statistical rules to alleviate the problem of imbalanced data categories (oil slick and non-oil slick) [30].Wan, H. et al., combined the characteristics of ship violations and proposed a method of constructing a ship knowledge graph through scientific knowledge graph technology to improve ship management and service capabilities [31].Wan, H. et al., used scientific knowledge graph technology to construct a ship knowledge graph for analyzing ship violations.It is used to improve the refined management and service capabilities of ships [32].Zhang, L. et al., constructed a knowledge graph in the field of river knowledge visualization.It can be used to quickly and intelligently identify other types of sedimentary phases from the literature data or directly on-site [33].Therefore, knowledge graphs can intuitively display key information about marine pollution accidents, while extraction results can provide rapid and professional decision information services for the construction of marine pollution prevention and control capabilities.

Basic Concepts
A knowledge graph, also known as a scientific knowledge graph, is a series of graphical representations that display the development process and structural relationships of knowledge.It utilizes visualization techniques to describe knowledge resources and their carriers, and to explore, analyze, construct, visualize, and display knowledge and their interconnections.In logical terms, knowledge graphs can be divided into two layers: the data layer and the pattern layer.The pattern layer, as the core of the knowledge graph, is positioned above the data layer.It primarily employs ontologies to standardize a series of factual expressions in the data layer, corresponding to actual data specifications and terminology expressions.The data layer mainly consists of a set of fact-based knowledge, forming the ontology of the knowledge graph.By constructing the ontology into a triplet form of "entity-relationship-entity" and "entity-property-property value," and storing it as the organizational form of the ontology in a graph database, a vast network of entity relationships is formed, thereby constructing the knowledge graph.

Construction Process
Based on unstructured, semi-structured, and structured data, knowledge is extracted and stored through manual, automatic, or semi-automatic processing methods, ultimately constructing a knowledge graph.The construction process requires continuous updating and iteration with the growth of knowledge.Each iteration mainly consists of three stages: knowledge extraction, knowledge fusion, and knowledge processing.The construction process of the knowledge graph is illustrated in Figure 1.Knowledge extraction refers to using methods such as entity extraction and relationship extraction to extract structured knowledge such as entities, relationships, and attributes from unstructured or semi-structured data.Knowledge fusion involves de-duplication processing of the structured knowledge obtained from knowledge extraction, mainly including entity disambiguation and coreference resolution.Knowledge processing refers to updating and improving the ontology library based on the growth of knowledge and its practical application in the knowledge graph.
There are two main approaches to constructing knowledge graphs: top-down and bottom-up.The top-down approach involves first creating the ontology and data layer pattern of the knowledge graph at the top level and then sequentially storing entities and relationships in the knowledge base.The bottom-up approach involves extracting representation patterns of knowledge from open data sources using technical means, selecting those with high credibility for storage in the knowledge base, and constructing the toplevel ontology pattern accordingly.Given the characteristics of unstructured data and complex knowledge associations in marine ship pollution accident data, this paper adopts a bottom-up approach to construct the knowledge graph of marine ship pollution accidents.

Knowledge Storage Method
In order to achieve the optimal storage efficiency and query speed for the vast amount of complex, loosely structured, and highly interconnected knowledge data in knowledge graphs, and after a comprehensive analysis of the characteristics of knowledge in the field of marine ship pollution accidents and the application scenarios of knowledge graphs, this paper utilizes the Neo4j graph database for storing knowledge related to marine ship pollution accidents.Graph database storage is a type of NoSQL database storage, which fundamentally means storing and querying data using a "graph" as the data structure.Graph databases abstract graphs into essential elements, such nodes and edges, and store them in graph data structures.Graph databases have more application advantages than other databases when dealing with large amounts of low-structured, complex, interconnected, frequently changing data that often requires query operations.Table 1 provides an overview of the Neo4j database, including the company producing it, its application scenarios, whether it is open source, whether it has visualization tools, supported operating systems, and other details.  1, The Neo4j graph database can store tens of billions of entities and relationships, has visualization capabilities, and the number and quality of searchable documents are at a high level.Released in 2010, Neo4j is a typical graph database that stores structured graph data.It employs a network-based storage approach rather than traditional tabular data storage, offering complete database functionality.Additionally, it supports rich graph computation features, serving as an embedded, high-performance, highly available, and lightweight graph computing tool.The main purpose of constructing the knowledge graph of marine ship pollution accidents in this paper is to provide assistance for subsequent applications.Given the frequent querying and updating of entities and relationships in the knowledge base, the unique data storage approach of the Neo4j graph database facilitates high-frequency read-write operations.

Design Framework of Knowledge Graph for Marine Ship Pollution Accidents
The knowledge graph design framework in the field of marine ship pollution accidents constructed in this paper is shown in Figure 2.  The design framework of the knowledge graph for marine ship pollution accidents comprises three levels: the pattern layer, the data layer, and the management application layer.The data layer primarily handles structured, unstructured, and semi-structured data related to marine ship pollution accidents, serving as the foundational corpus of the knowledge graph and preparing for subsequent knowledge extraction tasks.The model layer extracts entities and relationships of corresponding types under the reference of the entity, links them according to the relationships to form a structured knowledge network, and stores it in the Neo4j graph database.The management application layer represents the application domain of the knowledge graph, allowing the utilization of the knowledge graph for matching, retrieval, or querying relevant data and knowledge related to marine ship pollution accidents.Based on the visualization and structured expertise provided by the knowledge graph, it allows for information retrieval and assists in analyzing and guiding on-site personnel in studying causality.

Pattern Layer
The pattern layer is the model structure of the knowledge graph, which describes the conceptual model of various entities, attributes, and relationships within the domain.Based on 411 reports of marine ship pollution accidents, the ontological types were extracted and designed.
Among them, the pollution accident case name is used as the parent class, including five core concepts, namely, accident type (operational or accidental), involved ship, accident cause, accident loss, and accident level.The five core concepts make the constructed ontology library have a clear hierarchical structure.Table 2 describes the relationship description between entities and entities or attributes in the core concept of ship pollution accidents.The relationship description is used to explain the logical connection between entities and entities or attributes in the core concept, which helps us to understand the connection between accidents.

Involves result
The grade is determined by the result of the pollution accident Based on rule-based matching, the structure of entities and relationships generally aligns with the text and is often presented in tabular form as basic ship information.It is typically titled "Ship Basic Data" and includes information such as ship name, port of registry, ship type, hull material, gross tonnage, cargo capacity, etc.When constructing the data layer, the original text structure is mainly adopted as a feature to describe the information of the involved ships, allowing for the effective and accurate extraction of ship-related attributes and entities.Specific entities and attributes in the ship overview include ship name, port of registry, gross tonnage, number of main engines, main engine power, overall length, breadth, cargo capacity, ship type, hull material, shipyard, shipowner, etc.

Data Layer
In the knowledge graph, the pattern layer is usually placed above the data layer.The data layer contains actual data, such as entities, relationships, and attributes.The pattern layer defines the pattern information, such as the structure, relationship, and constraint between entities in the data layer, as well as the methods of organizing, querying, and analyzing data.The pattern layer can be regarded as an abstraction and description of the data layer.It provides high-level understanding and operation of the data, so that the data can be understood and used more effectively.Based on the entities and relationships obtained from the core concepts defined in the pattern layer, the data layer of the knowledge graph is constructed.This paper selects 411 marine ship pollution accident report data, with an average text length of 514 characters in the unprocessed reports, and performs preprocessing, entity extraction, and relationship extraction on them.Figure 3

Management Application Layer
The completed knowledge graph has many applications, such as information query, cause analysis, decision support, graph visualization, etc.Through graph visualization, different information about the accident can be clearly displayed and saved, which is convenient for accident information query and decision support.At the same time, when similar incidents occur, relevant causes can be found to speed up the judicial process, thereby simplifying the marine accident investigation process.

Data Source Validation
When building the data layer, we need to solve the problem of using certain derived entities and relationships to avoid inconsistent or inaccurate information in the knowledge graph generated after annotation.First, verify the data source to ensure that the data source used is credible and accurate.For ship pollution accidents, multiple data sources are involved, such as ship tracking data, maritime accident reports, etc. Verifying the reliability and accuracy of these data sources is the first step in verifying the accuracy of derived entities and relationships.We solve some problems with derived entities and relationships that are difficult to discover with machine learning through manual review.Finally, after our review is completed, professional personnel will review the data again to ensure that any inconsistent or inaccurate information can be corrected.

Entity Recognition Implementation Based on Rule Matching
This study uses rule-based matching methods to extract information from semi-structured text and establishes rule standards to annotate entities and relationships.These rules may involve ship operation rules, environmental regulations, etc., to ensure that these rules are consistent with the professional knowledge and practices in the field of ship pollution accidents.Firstly, we used the entities and relationships that need to be extracted as rule templates, which include the physical attributes of the ship, such as ship name, registered port, total tonnage, number of main engines, main engine power, length, ship width, and cargo capacity, and other entities such as ship types, materials, shipyards, and shipowners.Then, using a bottom-up approach, we summarized the synonyms of the proprietary terms mentioned above from the text and constructed a list of synonyms applicable to the field of ship pollution accidents, as shown in Table 3.Then, we used Python language to read the text corpus and used commas and spaces as separators to convert the content on both sides into a dictionary.When the data matching was successful, we extracted the corresponding values as knowledge elements.Taking Table 4 as an example, the results shown in Table 4 are extracted from the basic data of ships through rule matching, which include some attributes such as vessel name, registered port, length, vessel width, vessel type, and owner entity.

Text Vectorization
After annotating the defined data tags, in order for the computer to have a deep understanding of the information in the text, the annotated data needs to be converted into a word vector form that the computer can recognize.The Word2vec word vector model can solve this problem well.Word2Vec contains two learning model groups: the skipgram model and the CBOW model.The skip-gram model predicts the surrounding words based on the target word.The CBOW model is just the opposite of the skip-gram model.It mainly predicts the target central word through the words around the sentence sequence.Both models have their advantages.The skip-gram model is suitable for processing extended text information, while the CBOW model is more suitable for processing short text information.
(1) Skip-Gram Model The skip-gram model diagram is shown in Figure 4.The skip-gram model maps each word to a low-dimensional real vector space by learning word embedding vectors.These vectors capture the semantic similarity between words, making words with similar semantics closer in the vector space, so that words can be semantically compared and operated in the vector space.For example, set the input of the skip-gram model to  set a sliding window of size three, and the output is the word vector of the context of  , which are  ,  ,  ,  ,   .The specific calculation formula is as follows: (2) CBOW Model The CBOW model mainly predicts the central word through the surrounding context words.The model diagram is shown in Figure 5.If the central word  is given and the window size is set to five, the central word  is predicted through the four context words  ,  ,  ,  .The specific calculation formula of this model is as follows: Input layer Hidden layer Output layer

Entity Extraction Based on BERT-BiLSTM-CRF
The BERT-BiLSTM-CRF model is an end-to-end model that can be trained end-toend directly on labeled data and can also directly output the final entity label sequence during prediction, avoiding the multi-stage training and prediction process in traditional models and simplifying the design and use of the model.This paper applies the method of combining BERT with the traditional named entity recognition BiLSTM (Bidirectional Long Short-term Memory)-CRF (Conditional Random Field) model to extract entities.The model mainly consists of three parts, namely, the BERT layer, the BiLSTM layer, and the CRF layer.First, the unstructured text of the marine ship pollution accident is input into the BERT model to generate a dynamic word vector to better express the text features.Then, the pre-trained word feature vector is used as input, and bidirectional training is performed in the BiLSTM model to deeply learn the full-text feature information.The attention mechanism is introduced to score the context information of each part according to the degree of attention, give a higher weight ratio to the focus part, and highlight the features that play a key role in entity recognition.Finally, the CRF algorithm is used to decode and optimize the output results to avoid the appearance of unreasonable labels and obtain the global optimal sequence, and then extract and classify each entity to complete the task of identifying entities related to pollution accidents.The overall framework is shown in Figure 6.(1) BERT Layer BERT, as a pre-trained language representation model in natural language processing, can learn the relationships between sentences and between words, and assign weights based on the calculated relationships to obtain feature vectors with contextual information.This model introduces self-attention mechanisms, enabling it to extract meaningful information from the surrounding text during the training phase.The structure diagram of the BERT pre-trained model is shown in Figure 7.In the initial stage, the accident text is first converted into word vectors through the BERT pre-trained model.Among them, the [CLS] tag is used to represent the whole sentence classification, and the [SEP] tag is used to distinguish two independent sentences.Among them,  ,  , ⋯ ,  are the input word vectors of the model, and after passing through multiple encoding converters  , the output is  ,  , ⋯ ,  word vector sequences.The role of the self-attention mechanism is to obtain features from different aspects.It contains three important quantities, namely, Query, Key, and Value.The principle is to find the most essential information for the query variable from a large amount of information, Key through a query variable, Query, and apply it to Value to suppress other useless information.At the same time, we aim to avoid potential biases or limitations when using the BERT model for entity extraction.We have made specific adjustments to the BERT model to improve its performance in ship pollution accidents.In addition, it also combines the BiLSTM-CRF model to enhance the accuracy and robustness of entity extraction.
(2) BiLSTM Layer LSTM can obtain longer information and can also achieve two-way semantic capture to obtain more comprehensive semantic information.The input and output of the LSTM are controlled by three gates, namely, the forget gate, input gate, and output gate.Its unit structure is shown in Figure 8.The formula for calculating the output of the hidden layer in long short-term memory (LSTM) networks is as follows: In the formula,  represents the weight matrix connecting the two layers;  represents the bias vector;  sigma represents the sigmoid activation function; ⊗ otimes represents the dot multiplication operation;  represents the input vector at time ;  ,  and  represent the forget gate, input gate, and output gate at time , respectively;  ⃗ represents the state at time ; and ℎ represents the output at time .Considering that the LSTM network cannot encode information from back to front, researchers proposed splicing the two forward and backward LSTM networks together to form a BiLSTM network.The word vector sequence of the BiLSTM network will be calculated through the sequential and reverse layers to form a hidden state sequence with context information.Compared with LSTM, BiLSTM can achieve bidirectional semantic capture and obtain more comprehensive semantic details, thereby improving the accuracy of entity recognition.BiLSTM has gradually become popular in entity recognition tasks because of this advantage.
(3) CRF Layer The best tagging path among all possible tagging paths is decoded in the CRF layer to constrain the final sequence labeling.The CRF layer takes the score matrix  output by the BiLSTM layer as input.The element  , in the score matrix  is the score of the  tag of the  word in the sentence.In addition, the CRF layer introduces a tag conversion matrix , which represents the conversion score of consecutive words from tag  to tag , and represents the initial score starting from tag .For the score of the input sentence  and the sequence to be output   ⋯  ⋯  , the calculation method of its predicted score  ,  is as follows: In this equation,   represents the probability of transferring from label  to label  in the probability transfer matrix,  , refers to the probability of  being labeled as label  ,  ,  represents the calculated score of the label sequence, and the output sequence is the sequence with the highest final score.

Relationship Extraction Based on BiLSTM-CRF
The BiLSTM-CRF relationship extraction model combines the advantages of sequence modeling and label dependency modeling, achieving good performance in relationship extraction tasks.Therefore, this paper adopts the BiLSTM-CRF model for extracting ship pollution accident events.The BiLSTM consists of forward and backward LSTM networks, effectively addressing the issue of one-way information propagation in traditional LSTM models.Combining CRF on top of the BiLSTM model can add constraints to the final label predictions to ensure their reasonableness and improve the accuracy of the prediction results.The BiLSTM-CRF model is divided into four layers: the data layer, the vector representation layer, the BiLSTM layer, and the CRF layer.The specific model structure is shown in Figure 9.In this article, we used BERT-BiLSTM-CRF and BiLSTM-CRF for knowledge extraction.During the extraction process, the model made several assumptions.Firstly, when inputting a sentence, the words and short words related to the meaning of the words should be consistent in the preceding and following sentences so that the model can obtain more information from the preceding and following sentences and better understand the meaning of the phrase.The reality and relationships extracted from this model are closely related to the text above and below.Therefore, the model considers the overall vocabulary when extracting phrases, not just the local vocabulary information.Secondly, we hypothetically embed the facts and relationships in a way that can capture the semantic similarities and connections between the facts and relationships.In this way, the model can embed these data into the information recognition process.We assume that this method can solve these problems so that the model can achieve good performance in constructing and training data, which ensures the feasibility and quasi probability of our research results.

Experimental Environment
The experimental environment of this study was Windows 10 64-bit operating system, and the open-source Python machine learning library Pytorch framework was used to train the model.The specific configuration is shown in Table 5.The model employs 200 hidden units for both the forward and backward LSTM layers, implements a dropout strategy with a rate of 0.5 for input and output layers of the BiLSTM network to prevent overfitting, uses a batch size of 2 and 120 epochs during model parameter training, and utilizes the Adam optimizer with an initial learning rate of 0.01, as detailed in Table 6.9)-( 11), accuracy is used to assess the model's ability to predict labels correctly, recall evaluates the model's ability to retrieve all types of entities or relations, and the 1 score assesses the stability of the model.
(True Positive) is the number of correctly identified entity or relationship labels of category . (false positive) is the number of incorrectly identified entity or relationship labels of category , where  represents the label category of the defined entity or relationship.  represents the total number of recognized entity and relationship labels. (false negative) is the number of unrecognized entity or relationship labels of category . is the total number of categories of entity or relationship labels.Precision, recall, and the 1 score can be used as evaluation indicators for models in entity extraction and relationship extraction tasks.
The results of entity and relation extraction are shown in Table 7.Under the same experimental environment and data, the BiLSTM model achieved accuracy, recall, and 1 scores of 61.58%, 62.45%, and 60.81%, respectively.Meanwhile, the BERT-BiLSTM model obtained accuracy, recall, and 1 scores of 72.48%, 75.59%, and 73.51%, respectively.Both accuracy, recall, and 1 scores of the BiLSTM network embedded with the BERT pre-trained language model were superior to the original BiLSTM model, as the BERT pretrained word vector model has more substantial feature extraction capabilities, enabling the model to represent words better and thoroughly learn feature information from the text.Compared to the BiLSTM-CRF model, the BERT-BiLSTM-CRF model with the addition of the CRF network showed improvements in accuracy, recall, and 1 scores, reaching 82.47%, 83.69%, and 85.48%, respectively, with increments of 20.89%, 21.24%, and 24.67%.The main reason for this improvement is that the conditional random field can utilize rich internal and contextual information to achieve the global optimal effect.From the table, we can see that this study has a high accuracy rate in the process of entity and relationship extraction, which mainly depends on our more rigorous cleaning and labeling of the ship pollution accident data used in the process of knowledge graph construction.We ensured the accuracy and consistency of the data and removed erroneous or inaccurate information to avoid introducing noise into the knowledge graph.At the same time, we optimized the BERT-BiLSTM-CRF extraction algorithm to improve the accuracy of entity and relationship extraction.Finally, we applied manual review and error correction mechanisms in the process of knowledge graph construction.Through expert review and correction, we can correct erroneous information in the knowledge graph, thereby improving the quality and application effectiveness of the knowledge graph.
We evaluated the loss values of some models during the extraction process during training to ensure that the extraction model with the best performance and generalization ability was selected.According to the loss value curve during the training process, we can see that the curves of the three models, BiLSTM, BERT-BiLSTM, and BERT-BiLSTM-CRF, gradually stabilized after 60 training times, and the loss value is zero.We found that as the number of training iterations increased, the BERT-BiLSTM-CRF model's loss value decreased the fastest and performed better in effectiveness, as shown in Figure 10.

Knowledge Graph Visualization
A knowledge graph of marine ship pollution incidents was successfully constructed and stored in the Neo4j graph database, as shown in Figure 11.This graph contains 3928 valid entities and 5793 valid relationships.Nodes in the graph represent entities, and edges represent relationships between these entities.Storing the information in a graph database enables quick querying and localization of information.Through visualization, this graph can present information about the incidents, including the type of incident, causes, involved ship information, consequences, and pollution situations.By converting this information into node labels, key elements can be effectively integrated, reducing the reading burden and providing assistance for information retrieval and investigation.

Query Performance Response Time Sensitivity Test
We analyzed the response time required when retrieving results with different restrictions to determine the effectiveness and scalability of the research method, as shown in Table 8.We conducted a sensitivity analysis on the knowledge graph and observed the quantity and response time of the system's returned results by setting different constraints.We found that as the number increases, the number and time limit of entries also increase.
The results indicate that the results are effective, and we can retrieve a large amount of relevant information from the knowledge graph to support the analysis of ship pollution accidents.Secondly, as the result limit increases, the response time increases accordingly.Experimental data show that the complexity of queries will increase with the number of returned results, which will lead to longer response times.Therefore, when dealing with large-scale data, it is necessary to consider the impact of query performance and response time on research.When we limit the results to the maximum value, the response time is only 28 s, which fully demonstrates the effectiveness and scalability of this research method in the analysis of marine pollution accidents.

Accident Case Retrieval
Cypher is the query language of the graph database Neo4j, which is used to query and operate data in the graph database.MATCH is used to specify a pattern and query nodes and relationships that match the pattern.For example, if you need to query specific pollution accident information, you can enter MATCH (n: MAXIMA) ETURN n LIMIT 1.As shown in Figure 12, the exact information of the pollution accident is displayed, including the type of accident, cause of the accident, time involved, amount of oil spilled, and other related information.In the figure, (* 30) and (* 29) represent the number of entities, attributes, and relationships of the accident case.During the case retrieval process, the node expansion function can be used to directly connect it with other accident information, and the current trend of such accidents can be inferred.For example, the following questions can be answered: Which factors have the highest proportion among the direct causes of pollution accidents?Where do these concentrated accidents occur?What types of ships cause frequent pollution accidents?By exploring these questions, lessons learned from pollution accident cases can be analyzed, and marine administration departments can also simplify the accident investigation process through case retrieval.

Accident Cause Analysis
The occurrence of ship pollution accidents involves many factors.In order to effectively prevent accidents, it is necessary to analyze the correlation between various factors.Centrality affectedness can be used to evaluate the importance of nodes in the knowledge graph, identify key nodes, and analyze the influence, affectedness, centrality, and causality degree of nodes.Influence refers to the degree of influence of a node on other nodes, and affectedness refers to the degree to which a node is influenced by other nodes.Centrality and the causality degree measure the importance of nodes in the knowledge graph, which helps to analyze the correlation between different accidents and effectively identify and explore the internal logic of ship pollution accidents.
The calculation steps of centrality are as follows: Step 1: determine the causal factors in the system.
Step 2: determine the influence relationship between the factors and clarify the influence degree between the factors through quantitative methods, so as to establish the influence matrix .
Step 3: normalize matrix  to obtain matrix .
Step 4: calculate the comprehensive influence matrix  between the causal factors in the system.The calculation formula of matrix  is     1, where  is the unit matrix.
Step 5: calculate the influence and affectedness of each causal factor.The influence of the causal factor is the sum of the elements in each row of matrix , and the affectedness is the sum of the elements in each column of matrix .The calculation formulas for the influence degree  and the affectedness degree  of the causal factor are as follows: Step 6: calculate the centrality and causality of each causal factor.The centrality of an element is the sum of the influence degree and the affectedness degree:    ; the causality of an element is the difference between the influence degree and the affectedness degree:    .
After completing the calculation steps above, the influence degree, affectedness degree, centrality, and causality of each causal factor can be obtained.As shown in Table 9, this paper counts 20 important causal factors with centrality > 1.If the centrality of a node is greater than 1, it usually means that the node is very important in the knowledge graph and may be the core node of the knowledge graph or a key node connecting different parts.Such nodes play a key role in information dissemination and diffusion.Analyzing ship accidents according to direct causes, indirect causes, and objective causes helps to fully and deeply understand the causes of accidents and provides strong support for formulating effective prevention and response measures.Direct causes usually refer to specific events or behaviors that directly lead to accidents, while indirect causes may be inducements or background factors that lead to direct causes.Objective causes usually refer to problems caused by the environment.
(1) Direct causes: According to the table, the centrality and causality degrees of "lack of lookout", "failure to make a comprehensive assessment of the situation and collision risk", and "failure to fulfill the obligation of the give-way ship" are ranked at the top, indicating that these three nodes are the most important in the knowledge graph.Their influence is also ranked at the top, indicating that the node has the highest degree of influence on other nodes and is least affected by other influences.In view of these reasons, crew safety training, management duty, and focus at work should be strengthened to reduce accidents.
(2) Indirect causes: According to the table, the centrality and causality degree of indirect causes caused by direct causes such as "not equipped with a sufficient number of qualified crew members", "crew certificates do not meet the grade requirements", and "safety management system fails to operate effectively" are ranked at the top, and the influence value is high, which shows that indirect causes also play a decisive role in the occurrence of pollution accidents.Therefore, in response to these problems, the management of shipping companies, supervision by law enforcement departments, and crew training should be strengthened to prevent similar accidents from happening again.(3) Objective causes: According to the table, the affectedness degree of accidents caused by factors such as "poor visibility" and "stormy weather" is 0, but the centrality is greater than 1, indicating that they are the main objective factors leading to pollution accidents.However, the causality degree is less than 0, indicating that they do not dominate a large number of pollution accidents.In view of these objective factors, it is necessary to conduct a careful weather and route analysis before the ship sails and choose a suitable and safe route to minimize the occurrence of accidents caused by objective factors.

Discussion
In the study of ship pollution accidents through knowledge graph technology, we have found that this method plays a vital role in improving accident prevention, emergency response capabilities, and strengthening marine environmental protection.Previous studies have shown that accident prevention is an essential purpose of accident investigation (Chen et al., 2019) [4].Existing research focuses more on analyzing the factors that cause accidents or predicting the risks of future accidents (Sevgili et al., 2022) [9].Due to the complexity of pollution accidents, especially when it comes to human factors, we find that human reliability assessments and simple statistical analysis methods may not provide information (Heij et al., 2011) [5].Therefore, it is necessary to propose an effective method to identify and store key information of ship pollution accidents.
In recent years, much research has been conducted on ship pollution accidents because these accidents may seriously threaten human life and environmental safety.Some studies have focused on analyzing the influencing factors of ship pollution accidents from a macro perspective but fail to understand the specific circumstances of each pollution accident from a micro perspective (Sathish et al., 2023) [16].Other studies have used Bayesian networks or machine learning methods to predict ship oil spill accidents (Tang et al., 2016) [11].Although accident prediction can provide accurate probabilities, analyzing the causes of accidents is ignored.There are also some studies that have focused on data analysis methods, such as data mining, data-driven, etc., to explore the causes of pollution accidents (Huang et al., 2023) [15].However, most cause analyses ignore the joint impact of various factors on the occurrence of pollution accidents.
To avoid the shortcomings above, we can use knowledge graph technology to study and analyze ship pollution accidents.A knowledge graph can not only provide specific information about accidents, but also display the internal correlations of relevant accident elements (Gan et al., 2023) [27].Knowledge graphs constructed based on the characteristics of pollution accidents show the potential to support the analysis of ship pollution accidents.Previous studies have shown that identifying the main factors of accidents and constructing conceptual models of accidents can help prevent and manage traffic accidents (Wan et al., 2023) [31].Although pollution accident reports have been successfully applied in research fields such as accident prediction and cause analysis, they still face a series of problems, such as cumbersome and time-consuming manual preprocessing and the inability to clearly display ship pollution accident information.
In the era of big data, it is of great significance to propose a practical and innovative method to extract critical information from ship pollution accident reports.When using the framework proposed in this study, actual information can be extracted from the investigation report of ship pollution accidents, which includes types of pollution accidents, ship parameters, direct causes of accidents, objective causes, indirect causes, oil spills, types of pollutants, etc.In addition, the constructed knowledge graph and causal analysis can be used for decision-making systems and experience sharing to improve the significance of maritime safety management.

Conclusions
This study proposes a new method for analyzing ship pollution accidents using knowledge graph technology.We collected 411 pieces of information on ship pollution accidents in coastal areas of China from the official website of the China Maritime Safety Administration.The data sources we obtained information from are legal and compliant, and they are the fundamental data for constructing a knowledge graph.
Firstly, based on the construction process of the knowledge graph, we designed a framework for constructing the knowledge graph of ship pollution accidents.In addition, based on the characteristics of ship pollution accident information, we defined the basic concepts of entities and relationships.The word vector model of Word2vec was used to vectorize the annotated text, and BERT-BiLSTM-CRF and BiLSTM-CRF were used to extract entities and relationships.We used the Neo4j graph database to extract and store knowledge of 411 ship pollution accidents in China, and visualized the knowledge graph, generating 3928 valid entities and 5793 valid relationships.The accuracy rates of entity extraction and relationship extraction models are 79.45% and 82.47%, respectively, indicating that the model has good accuracy.Visualization of knowledge graphs can clearly present various information about accidents.Secondly, using Cypher language for quick retrieval, the extracted text information can be used for analyzing the causes of pollution accidents and data statistics.Compared with traditional manual or keyword queries, this method extracts information faster and more effectively.It provides guidance and support for further causal analysis research.Finally, through the study of the central algorithm, the correlation between the causes of accidents was explored, and targeted measures were proposed.During the research process, in order to improve the effectiveness of the study, we cleaned and standardized the collected data.In addition, we integrated and conducted expert reviews on the data sources to prevent inaccurate data from affecting the effectiveness of knowledge graphs and centrality algorithms as much as possible.
At the same time, we also integrated knowledge and methods from different fields to gain a more comprehensive understanding and address issues related to accident prevention and emergency response.Through on-site investigations and case studies, summarized experiences, lessons learned, and best practices, we are able to provide reference and inspiration for improving accident prevention and emergency response capabilities.

Management Significance
Using knowledge graph technology to study the knowledge graph of ship pollution accidents and conduct centrality algorithm analysis can help improve managers' understanding and response capabilities to ship pollution accidents, optimize management decisions and policy formulation, and thus more effectively protect the marine environment and promote sustainable development.Our knowledge graph has the following management significance.
1. Visualizing the relationship between ship pollution accidents.Knowledge graph visualization can visually display key nodes and relationships in ship pollution accidents, such as ship parameters, pollution types, oil spills, etc. Management personnel can also quickly understand the overall information and development trends of pollution accidents through visual graphs, helping them better handle pollution accidents.2. Case retrieval and experience sharing.Management personnel can easily discover similar cases of ship pollution accidents in history and learn relevant experiences and lessons by using the case retrieval function through knowledge graphs.This helps managers to develop more timely response measures and emergency plans, improving their ability and efficiency in responding to ship pollution accidents.3. Centrality algorithm analysis.By using centrality algorithms to analyze nodes in a knowledge graph, managers can identify nodes that hold significant importance and influence in the graph.These nodes may be ships with frequent pollution accidents, key nodes where pollution accidents occur, etc.Managers can focus on monitoring and managing these key nodes to improve the monitoring and response capabilities of ship pollution accidents.4. Decision support and policy making.Based on the visualization and analysis results of the knowledge graph, managers can formulate management policies and response measures for ship pollution accidents more scientifically.By deeply understanding the key factors and influencing factors of pollution accidents, managers can formulate targeted policies and measures to improve the prevention and control of ship pollution accidents.5. Data-driven management decisions.Knowledge graph visualization and analysis provide managers with data-based decision support, enabling them to make management decisions more rationally and scientifically.Through in-depth analysis of data in the knowledge graph, managers can adjust and optimize management strategies in a timely manner to improve management efficiency and decision-making.

Limitations and Research Prospects
This study proposes a new method for analyzing ship pollution accidents using knowledge graph technology and practices the application of knowledge graphs in the field of ship pollution accidents.However, this study also has some limitations.First, in our knowledge graph construction and model training, the data sources used are from China's coastal areas, and pollution accidents in other countries and regions have not been considered for research.Subsequent research will further collect and supplement data from other areas.Secondly, this study mainly considers the entity and relationship extraction of ship pollution accidents without considering the impact of meteorological conditions, marine environment, and other factors on the accident.Future research will further improve the comprehensiveness and applicability of the model.
Future research directions include but are not limited to the following aspects: first, pollution accident data can be further supplemented and improved to improve the quality and integrity of the data.Secondly, pollution warning analysis methods can be explored, combined with real data monitoring technology, to enhance accident prediction and warning capabilities.In addition, in the future, we will also design experimental and evaluation methods to evaluate and compare different prevention and response strategies.Scientific experimental design and evaluation methods can be used to collect and analyze relevant data to evaluate the effectiveness and feasibility of research.By strengthening international cooperation and sharing, research institutions, industry organizations, and government departments in other countries and regions can collaborate to jointly carry out research and practice accident prevention and emergency response.In summary, although this study has certain limitations, it provides important support for the prevention and control of ship pollution and the judicial process of pollution accidents.Future research will further improve and expand this method and improve its application effect in the field of maritime accident investigation and environmental protection.

Figure 2 .
Figure 2. Design framework of knowledge graph for marine ship pollution accidents.

Figure 3 .
Figure 3. Example illustration of pattern layer and data layer construction results.

Figure 10 .
Figure 10.Loss curves of different models.

Figure 11 .
Figure 11.Storage of the results with a triple in Neo4j (partial).

Author
Contributions: Conceptualization, P.Z.; Methodology, J.H. and G.L.; Software, J.H. and W.Z.; Formal analysis, J.H.; Investigation, P.Z. and G.L.; Data curation, W.Z.; Writing -original draft, J.H. and G.L. All authors have read and agreed to the published version of the manuscript.All authors have read and agreed to the published version of the manuscript.Funding: The work was supported in part by the Ningbo International Science and Technology Cooperation Project (2023H020), the Key R&D Program of Zhejiang Province (2024C01180), the National Natural Science Foundation of China (52272334), the EC H2020 Project (690713), and the National Key Research and Development Program of China (2017YFE0194700).We would also like to thank the National "111" Centre on Safety and Intelligent Operation of Sea Bridges (D21013) for the financial support in publishing this paper.Institutional Review Board Statement: Not applicable.Informed Consent Statement: Not applicable.

Table 1 .
Introduction to the Neo4j graph database.

Table 2 .
Entity and relationship description table.
shows an example diagram of the pattern layer and data layer constructed in VISIO, where solid arrows of different colors represent the relationship between different entities, and the instantiated expression is realized by connecting with dotted arrows.Blue lines represent accident cases, yellow lines represent different attributes, orange lines represent relationships, and green lines represent attribute types.

Table 3 .
Synonym table for rule-matching in the field of ship pollution accidents.

Table 4 .
Example of rule-matching entity extraction.

Table 7 .
Experimental results of different models.

Table 8 .
Response time distribution table.

Table 9 .
Distribution table of centrality matrix.