4.1. Data Collection and Preprocessing
Accident report data is acquired from the official website of China MSA (
https://www.msa.gov.cn accessed on 24 February 2026). There was a total of 312 ship-collision investigation reports. A total of 90 reports were manually annotated and used in
Section 3.2 (LeBERT entity recognition model enhanced by domain vocabularies); 292 reports (including the 90) were used to construct the ship-collision knowledge-triple dataset for knowledge injection; 20 reports (non-overlapping with the 292) were used in
Section 3.4 (K-BERT-based entity recognition model). All 312 reports were used for knowledge graph construction and graph-structure-based severity classification.
Based on the structural characteristics of ship-collision investigation reports, the data is categorized into semi-structured and unstructured components. For semi-structured data, which primarily involves tabular information regarding vessel and personnel characteristics, an automated extraction framework is implemented using the DeepSeek-V3 LLM accessed via its official API. Specifically, the temperature was set to 0 to enforce greedy decoding, thereby completely eliminating randomness and preventing hallucination. Both the frequency-penalty and presence-penalty were set to 0, as information extraction tasks require the precise replication of source terminology without penalizing vocabulary repetition. A “five-step prompt strategy” is designed based on the features of semi-structured text data, as detailed in
Table 5, encompassing task requirement, data description, sample data, scenario information, and standard output. First, the task requirement prompt is utilized to clearly define the extraction objectives, data fields, and attribute values, providing precise instructions to the LLM. Second, the data description prompt explicitly describes the detailed raw text format, key–value relationships, and other metadata, establishing a unified template for entity fields, relationship types, and attribute units to enhance extraction stability. Then, the sample data prompt provides instances of raw semi-structured text using formatted separators to differentiate between entities and attributes, thereby eliminating potential parsing conflicts and ensuring accurate parsing by the LLM. Furthermore, domain-specific prior knowledge is injected through the scenario information prompt to enhance task comprehension, while historical dialogue context is maintained within the API message buffer to allow for the iterative optimization of prompting strategies through contextual feedback. Finally, the standard output prompt explicitly regulates the output format to ensure it meets structured storage requirements for subsequent knowledge graph construction or database integration. By activating the LLM’s recognition of data parsing and conversion rules through this strategy, structured data is automatically generated.
Vessel feature data is utilized as validation data, where attributes such as Vessel Name, Former Name, Vessel Type, Port of Registry, IMO Number, and Call Sign are stored in semi-structured text within PDF files. By applying the five-step prompt strategy to inject instructions and constraints into the LLM, vessel feature extraction and conversion are realized, automatically generating standardized “entity-relation-entity” CSV files. The resulting structured data for the vessel “OMEGA”, following the completion of this extraction task, is presented in
Table 6.
The preprocessing of unstructured text data includes annotations for the entity and relationship. The standardization and cleaning of a large volume of collected text uniformly replaced the punctuation marks, English characters converted to half-angle form and redundant information elimination. The BIO (Beginning, Inside, Outside) annotation scheme was employed to label entity sequences in the text. Specifically, “B” indicates the beginning token of an entity, “I” denotes the subsequent tokens within the same entity, and “O” represents tokens that do not belong to any entity. Based on this strategy, the entities, including accident, vessel, vessel feature, vessel dynamics, equipment, personnel, personnel feature, organization, time, location, environment, cause, consequence, laws and regulations, and recommendations in the water transportation domain, are annotated. After completing the entity annotation, it is necessary to annotate the semantic relationships between entities further. The RE dataset is formed by {sentence, entity 1, relationship, entity 2} format. The annotated data can be used for entity and relationship automatic recognition experiments training.
4.2. Entity Recognition Based on Domain Vocabulary Enhancement
The dataset constructed in this experiment is derived from 90 ship-collision accident reports and contains 8,556 text segments. About 74.6% are under 100 words, 23.8% are between 100 and 200 words, and only a small fraction exceeds 200 words. For the LeBERT experiments, this dataset was split 9:1 into training and test subsets for model development and evaluation.
Table 7 shows the parameter configurations for the various components of the LeBERT-BiLSTM-CRF model, which incorporates the domain vocabulary enhancement mechanism used in the experiment. With forward–backward concatenation, the BiLSTM produces an output whose dimension is two times the hidden size. The base version of the Chinese RoBERTa with Whole Word Masking-Extended (Chinese RoBERTa-wwm-ext), developed by Harbin Institute of Technology and iFLYTEK Research (HFL), is used in this experiment and has approximately 125 million parameters. To improve training efficiency, this pre-trained model was fine-tuned during training using a small learning rate to balance model performance and resource consumption.
Table 8 shows the parameter configuration of each component module of the LeBERT-BiLSTM-CRF model that integrates the domain vocabulary enhancement mechanism in the experiment. Training used the Adam optimizer and minimized the CRF negative log-likelihood (NLL) as a sequence-level objective over the gold label sequences. Gradients were updated by backpropagation. The maximum sequence length was 512; the training/validation batch sizes were 32/16; the learning rates were 3 × 10
−5 for the encoder and 3 × 10
−3 for the CRF layer.
The NER experiments were run under the Windows 11 operating system, with the PyTorch 2.1.0 deep learning framework and CUDA 12.4 for model construction and training. In order to verify the effectiveness of the proposed LeBERT-BiLSTM-CRF model enhanced by domain vocabulary information on ship collision NER, several typical models are selected for comparison experiments, including the classical CRF model, BiLSTM combined with the CRF model and pre-trained language models with multiple variants of the BiLSTM-CRF combination. All the models are trained and tested based on the ship collision accident dataset, and the experimental results are shown in
Table 9. It can be seen that the precision, recall and F1-score of the CRF model are 68.1%, 70.1% and 69.4%, respectively. The results show that it is difficult to effectively capture the complex entity relationship and contextual semantic information in ship collision scenarios by solely relying on the sequence annotation mechanism. After the introduction of the BiLSTM structure on top of the CRF, the precision of the model slightly improves to 69.7%. However, the recall decreases significantly to 63.0%, resulting in a decrease of the overall F1-score to 66.1%. Although the BiLSTM structure increases the complexity and number of parameters of the model, it fails to effectively improve the model’s generalization performance in capturing the domain features in ship collision accidents. After introducing the pre-trained language model BERT on top of CRF, the model performance improved significantly, with precision, recall and F1-score reaching 72.8%, 72.1% and 71.9%, respectively, which is about 3%~6% improvement compared to the CRF model and BiLSTM-CRF model. This experimental result fully reflects the outstanding advantages of the BERT pre-trained model in feature extraction and semantic understanding.
BERT-BiLSTM-CRF leverages BERT to excavate the feature-context dependency relationship on the linguistic level deeply, and BiLSTM to better capture the long-distance dependency and sequential context, which effectively improves entity boundaries recognition and the overall performance of the model, with the precision, recall, and F1-score of the model reaching 80.7%, 79.4% and 80.1%, respectively. Furthermore, Chinese-BERT-WWM and RoBERTa were selected to be compared with BERT. The Chinese-BERT-WWM-BiLSTM-CRF model achieves 85.2%, 85.8% and 85.5% in precision, recall and F1-score, respectively, while the RoBERTa-BiLSTM-CRF model performs even better, with larger than 85.6% in each index. This result indicates that pre-trained models with more detailed masking strategies and training methods possess better performance for ship collision domain data. The LeBERT-BiLSTM-CRF model introduces a specially designed Lexicon Adapter structure to fuse lexical information with character features effectively. This model achieves the optimal performance, with precision, recall and F1-score reaching 86.3%, 87.5% and 86.8%, respectively, which are significantly better than the other models. This fully demonstrates that the incorporation of lexical information plays a key role in improving the performance of entity recognition in the field of ship collision accidents.
In order to more comprehensively evaluate the recognition performance of the LeBERT-BiLSTM-CRF model in the ship collision accident dataset, this paper further compares and analyzes the recognition accuracy of each model on different entity categories, as shown in
Table 10. The model relies only on CRF or BiLSTM-CRF for sequence annotation and performs differently in different types of entity recognition. The F1-score of “Vessel” is 89.7, while the F1-score of the categories with sparse data or semantic abstraction, such as “Environment” and “Recommendation”, is as low as 20~40%. It suggests that the basic model is unable to fully learn the feature representations when dealing with data-scarce, semantically ambiguous, or context-dependent entities. With the introduction of pre-trained language models in BERT-CRF and BERT-BiLSTM-CRF, the overall recognition performance is significantly improved, with richer semantic and intrinsic representational capabilities. Especially on high-frequency categories such as “Vessel”, “Personnel”, and “Vessel Feature”, the F1-score is further improved to over 80%. Meanwhile, the recognition performance of some low-frequency categories, such as “Environment” and “Agency,” is also improved, with an F1-score of about 50~70%. With the introduction of stronger pre-models such as Chinese-BERT-WWM and RoBERTa, the model performance continues to improve. Especially in the low-frequency entity categories such as “Recommendation” and “Event Cause”, the F1-score is significantly increased to 44~75%, indicating that under the relatively scarce training samples, the model can still effectively capture semantic features and improve the recognition accuracy. By introducing domain vocabulary into the pre-trained language model of RoBERTa, the model improves recognition of low-frequency class entities significantly. The F1-score for the lowest frequent entities in the “Recommendation” category is 57.5%, which is much higher than that of other models. The same as “Cause”, “Consequence”, “Equipment”, and “Environment”, the corresponding F1-score increases significantly to more than 75%. This result suggests that by incorporating external lexical knowledge enhancement strategies, the model was able to capture the boundaries of named entities more accurately, resulting in improved recognition of low-frequency categories and improved overall performance.
The ablation experiment aims to evaluate the contribution of individual modules to the overall performance of the model. In order to verify the actual effect of the pre-trained language module and lexical enhancement on the model performance, ablation experiments are conducted with BERT-BiLSTM-CRF, RoBERTa-BiLSTM-CRF, and LeBERT-BiLSTM-CRF (Information on vocabulary in common areas) as shown in
Table 11. BERT-BiLSTM-CRF as a baseline model used BERT-base-Chinese to generate dynamic word vectors at the embedding layer and finally achieved 80.7% precision, 79.4% recall, and an 80.1% F1-score, which verified the efficacy of the BERT module in-context semantic modelling. The BERT module is replaced with RoBERTa in RoBERTa-BiLSTM-CRF, and the semantic modelling is significantly enhanced, with the F1-score improved from 80.1% to 85.6%. That may be related to the full-word masking strategy and larger-scale corpus training strategy. The LeBERT-BiLSTM-CRF model is based on RoBERTa-BiLSTM-CRF. The results show that the F1-score is improved from 85.6% to 86.3%, indicating that the lexical enhancement strategy makes up for the deficiency of relying only on character representation to a certain extent and can better determine the entity boundary and improve the entity recognition performance. After replacing the generic vocabulary information with the ship collision accident domain vocabulary, it improves performance to 87.5% in recall and 86.8% in F1-score. This result shows that the introduction of domain vocabulary enhances recognizing entities with fuzzy boundaries in the text of ship collision accidents, provides the model with more targeted semantic features to expand coverage of domain entities, and strengthens its capability in capturing low-frequency and proprietary entities. The ablation experiments reveal that the NER method of fusing vocabulary information has strong practicality in fine-grained entity recognition of ship collision accidents.
To further verify the robustness and generalization capability of the LeBERT-BiLSTM-CRF model incorporating domain vocabulary information in the NER task, a 5-fold cross-validation method was employed for assessment. To ensure the consistency and comparability of the evaluation, the model architecture and hyperparameter settings were maintained identically to the configurations specified in
Table 7 and
Table 8.
Table 12 presents the specific performance metrics across five independent experiments. The results indicate highly consistent performance under different data partitions, with mean Precision, Recall, and F1-score reaching 86.28%, 87.13%, and 86.70%, respectively. Notably, the standard deviation of the F1-score is as low as 0.0078; such minimal fluctuation strongly demonstrates the superior robustness of the model. Furthermore, the highest F1-score of 0.8763 was achieved in Fold 4, showcasing the model’s exceptional peak performance.
Table 13 details the classification performance for 15 distinct entity types. Under the rigorous testing of 5-fold validation, the model maintained high recognition accuracy for categories such as “Vessel” and “Laws and Regulations”. Even for sparse categories like “Recommendation” and “Cause,” the average performance remained robust due to the injection of domain-specific lexical information. In conclusion, the 5-fold cross-validation results not only confirm the outstanding robustness of the LeBERT-BiLSTM-CRF model but also prove its reliable generalization capability through sustained high-level performance across multiple unseen data subsets.
Beyond the empirical robustness demonstrated by the cross-validation, it is essential to clarify the minimum data requirement for this extraction pipeline. When fine-tuning high-parameter architectures such as RoBERTa and LeBERT, the corpus scale must be evaluated at the sentence level rather than the document level. As outlined in
Section 4.1, the manually annotated dataset used for this specific task was constructed from 90 representative collision reports, yielding a total of 8556 sentence-level text segments. The existing literature on domain-specific NER indicates that leveraging pre-trained language models substantially reduces the dependency on massive annotated datasets. Empirical evidence demonstrates that a high-quality, domain-specific corpus of a few thousand sentences is typically sufficient to effectively fine-tune these models and reach a performance plateau [
74]. Therefore, our dataset of 8556 text segments comfortably exceeds this functional threshold.
Furthermore, statistical analysis reveals that approximately 74.6% of the extracted text segments contain fewer than 100 characters, and 23.8% range between 100 and 200 characters. This predominance of short text segments aligns perfectly with the inherent 512-token sequence processing limit inherited by the RoBERTa architecture [
75,
76]. More importantly, confining the input segments within this optimal processing window effectively prevents context dilution, frequently described as the “lost in the middle” phenomenon. This is a prevalent issue where the attention mechanism’s efficacy severely degrades when processing overly long documents [
77]. Coupled with the explicit injection of domain lexicons acting as a strong inductive bias, this specific data scale and length distribution collectively ensure rapid convergence and mitigate the risk of overfitting in the proposed pipeline.
4.3. Analysis of Relationship Extraction
Using the same 90 reports as the domain-vocabulary-enhanced LeBERT-BiLSTM-CRF NER experiments, the RE dataset comprises 11,086 labeled triples, split 8:2 into 8866 training and 2220 validation instances. The BERT-MLP_rule model is adopted for RE in ship collision accidents. The configuration of the BERT-MLP_rule model used in this article is detailed in
Table 14. The pre-trained language model used was the base version of Chinese-RoBERTa-wwm-ext. The input to the multilayer perceptron module consisted of the concatenation of the entire sentence context vector output by BERT and the embedding representations of the two entities, resulting in an input dimension three times that of the BERT hidden layer. The MLP architecture included a single hidden layer with 128 neurons. Given the 38 relation types involved in the experiment, the number of nodes in the model’s output layer was set to 38.
The hyperparameter configuration is shown in
Table 15. The text length and entity length in the hyperparameters are used to standardize the data input. TheChinese RoBERTa-wwm-ext is employed to reduce the training cost and improve the convergence efficiency. During the training process, the model loss is calculated using the cross-entropy function, the MLP layer weights are updated by backpropagation, and the parameters are optimized using the Adam optimizer.
The performance of the model in recognizing each category of relationships is shown in
Figure 8. For categories with more sufficient training data, the model shows better performance, such as “of_PersonFeature” and “at_Time”, with an F1-score of 0.98. For categories with fewer samples, it still shows relatively good performance, such as “rescue” and “occur”, with the F1-score of 0.75 and 0.73, respectively. This indicates that the model is capable of recognizing categories with sufficient training data. However, for the relationships “manipulate_of_NavigationStatus”, “on_of_EngineStatus”, and “at_of_EngineStatus” with small-size data, the recognition accuracies are 98.5, 98.4 and 98.1, respectively. This may be because their semantic and contextual features are more obvious, which enables the model to capture their relationships more accurately with limited data. The Chinese RoBERTa-wwm-ext model has strong contextual semantic capture capability, which is helpful for RE for sparse data. In a word, the model is able to efficiently identify and extract large-scale domain entity relationships in ship collision accidents.
To ensure the prediction quality and authenticity in the RE task, a quality control mechanism based on the correlation analysis between performance metrics and confidence scores was established. As illustrated in
Figure 9, the reliability of the large-scale extraction task is quantitatively evaluated by benchmarking the F1-score against the average confidence on the validation set. The experimental data reveal that as the training converges, the F1-score stabilizes above 0.93, while the peak average confidence reaches 0.987. This strong positive correlation demonstrates that the model possesses excellent self-calibration capability, and its confidence scores serve as a reliable benchmark for prediction veracity. Based on this validation, by utilizing high-confidence thresholds as an automated verification criterion, the quality of the subsequent knowledge graph construction can be effectively ensured, providing quantitative evidence for the authenticity of the knowledge graph.
4.4. Entity Recognition Based on K-BERT-BiLSTM-CRF
Knowledge injection uses a domain triple set extracted from 292 ship-collision investigation reports, including the 90 reports used to train the domain-vocabulary-enhanced LeBERT-BiLSTM-CRF model. Entities are recognized by the domain-vocabulary-enhanced LeBERT-BiLSTM-CRF model, and relations by the BERT-MLP_rule model. The 20-report dataset used to train/evaluate the K-BERT model is disjoint from this triple pool.
Table 16 shows the parameter configurations for each module of the K-BERT-BiLSTM-CRF model used in the actual experiment. Because the BiLSTM module concatenates the outputs of the forward and backward LSTM models, the hidden layer dimension of the module output is twice that of a single LSTM hidden layer. The base version of the Chinese RoBERTa-wwm-ext pre-trained model used in this experiment has approximately 125 million parameters. To improve training efficiency, the RoBERTa-wwm-ext pre-trained model was fine-tuned during training using a small learning rate to balance model performance and resource consumption.
Hyperparameter settings for the K-BERT-based entity recognition model are shown in
Table 17. The Adam optimizer was chosen to update the model parameters during the training process, and NLL was used as the loss function to measure the difference between the model output and the real values. The training process was set with a maximum sequence length of 512, a training batch size of 16, a validation batch size of 8, a learning rate of 1 × 10
−5 for the BERT layer, a learning rate of 1 × 10
−3 for the CRF layer, and 100 epochs.
In order to verify the effectiveness of the proposed K-BERT-BiLSTM-CRF model, it is compared with BERT-BiLSTM-CRF, RoBERTa-BiLSTM-CRF, and LeBERT-BiLSTM-CRF. RoBERTa-BiLSTM-CRF improves the BERT-BiLSTM-CRF architecture by replacing the Chinese RoBERTa with Whole Word Masking-Extended (Chinese RoBERTa-wwm-ext) as the pre-training language model with the Chinese-BERT-Base module. LeBERT-BiLSTM-CRF introduced domain vocabulary information. The K-BERT-BiLSTM-CRF model takes the self-constructed ship collision accident knowledge graph as external knowledge and injects it into BERT for training. As shown in
Table 18, the K-BERT-BiLSTM-CRF model achieves the optimal performance with precision, recall, and F1-score at 84.5%, 84.4%, and 84.7%, respectively. It indicated the effectiveness of introducing the domain knowledge graph for improving the NER performance. The BERT-BiLSTM-CRF, without introducing a domain enhancement mechanism, has a relatively low F1-score, only 78.0%. By using the Chinese RoBERTa-wwm-ext instead of BERT-base-Chinese as the encoder, the model performance is significantly improved with an F1 of 81.0%, suggesting that a stronger pre-trained language model can help to improve the performance. LeBERT-BiLSTM-CRF improves the F1-score to 83.5%, indicating that semantic enhancement has a positive impact on NER. The proposed K-BERT-BiLSTM-CRF model is better at recognizing domain terms and complex entities by introducing the domain knowledge of ship collision accidents. It not only keeps the semantic modelling capability of Chinese RoBERTa-wwm-ext but also utilizes the knowledge triples injected by K-BERT to provide support for context modelling. K-BERT-BiLSTM-CRF has significant advantages in the task of NER in the field of water transportation.
4.5. Knowledge Graph of Ship Collision Accidents
For knowledge-graph construction, a validated extraction pipeline was adopted: LeBERT-BiLSTM-CRF (domain-vocabulary enhanced) for NER and Chinese RoBERTa-wwm-ext-MLP for RE, applied to all 312 ship-collision investigation reports. The resulting ship collision prevention and control knowledge graph contains 35,000 entities and 320,000 relationships. The distributions of entities and relationships of the ship collision prevention and control knowledge graph are shown in
Figure 10 and
Figure 11, respectively. Except for the common entity types such as time, personnel and location, it integrates entity types specific to ship collision incident knowledge, including ship dynamics, environment, recommendations, etc. That provides a comprehensive description of the evolution of ship collision incidents.
To further validate the effectiveness of the constructed knowledge graph, this paper compares the number of entities and relationships with other knowledge graphs in the water transportation domain, as shown in
Table 19. The constructed knowledge graph for ship collision prevention and control is superior to the existing knowledge graphs in the water transportation domain across the entity and relation types, entity and relation volume, and the considered data type. It supports both semi-structured and unstructured data, covering 38 types of relations and 15 types of entities, with the highest number of entities and relations. Through the fine-grained entity and relationship types division, it can facilitate the applications in different scenarios, for instance, the association query and analysis among subject–space–time–behavior-driven accident evolution process, ship activity, and accident causation.
Knowledge graphs usually represent entities and their semantic relationships in the form of ternary groups, which have good expressive capability in characterizing structured knowledge. However, the ternary may lead to increased computational complexity in large-scale applications, limiting the operational efficiency of graphs. To address this problem, knowledge graph embedding techniques are used to map graph data to a real vector space, making it easier for intelligent algorithms (e.g., machine learning, deep learning, etc.) to utilize the information hidden in the graph data. This paper uses the t-distributed Stochastic Neighbor Embedding (t-SNE) method to reduce the dimension of the graph embedding from 768 to a two-dimensional plane. The K-means clustering method is used to analyze and visualize the embedding nodes of the graph, as shown in
Figure 12. The clustering results align with the 15 types of entities in the ship collision prevention and control knowledge graph, indicating that the knowledge representation effectively encodes the semantic concepts of knowledge entities related to ship collision incidents. Furthermore, nodes such as “Accident,” “Vessel,” and “Vessel feature” formed distinct clusters, indicating that these nodes are similar in high-dimensional embedding space and maintain this similarity when reduced to two-dimensional space.
The shortest path between the two ships involved in the accident can demonstrate the comprehensive ship collision accident knowledge contained in the constructed knowledge graph. The shortest path query equation is as follows:
where
represents matched query path,
represents that the path length is
, nodes
and
usually represent two different entities, and
represents the returned query path information. Taking the “Ansheng 22” ship and “Minshiyu 06256” ship collision accident as an example, the shortest path in
Figure 13 shows related entities such as the involved vessels, the companies of the involved vessels, vessel dynamics, accident locations, personnel on board, vessel equipment, accident causes, consequences of the accident, violated laws and regulations, and recommendations, as well as their relationships. This allows for a comprehensive association analysis of the entire process of a vessel collision incident.
Maritime supervisors can quickly obtain the overall overview of ship collision accidents through the shortest path query. Based on the subject–space–time–behavior analysis of the spatial and temporal process of the accident,
Figure 14 shows the ship dynamics of the ZTE 2 (ship name) during the collision accident and can support the ship collision situation awareness combined with the relevant AIS information. As shown in
Figure 15, the knowledge graph also supports the query of ship inspection activities so that maritime supervisors can understand whether the ship has violated the necessary inspection requirements. Ship collision prevention and control knowledge graph combined with NLP technology can improve the accuracy of NER in the field of water traffic accidents, analyze water traffic accidents more efficiently, and provide knowledge support for intelligent transportation decision-making.
4.6. Classification of Accidents Based on the Constructed Knowledge Graph
Accident severity can be determined according to the relevant standards of the Maritime Safety Administration of the Ministry of Transport of China. The classification criteria are mainly based on factors such as injury and death, economic losses, and the degree of environmental pollution caused by the accident. Although the current maritime regulations have clearly defined the accident severity level, the related classification mechanism mainly relies on manually filling in the consequences of accidents. It is difficult to satisfy the data-driven automatic identification of accident severity levels in practical application scenarios, such as emergency response, risk warning, and auxiliary supervision. The classification model based on a knowledge graph can quickly complete the NER after receiving the initial accident text and then realize the intelligent identification of accident severity level with good real-time accuracy and scalability.
The LSTM deep learning classification model is employed to predict the severity of injuries in ship collisions based on a knowledge graph. By introducing the topological information of the knowledge graph, the model is able to capture the complex relationship between the accident features so as to improve the accuracy. The topological features include the number of nodes, the sum of in-degrees, the sum of out-degrees, betweenness centrality, and closeness centrality. The Z-score normalization method was employed to address the differences in numerical magnitude between different topological features. To meet the model input requirements, the accident severity category variable was encoded using one-hot encoding with 0 (minor accident), 1 (ordinary accident), 2 (relatively serious accident), and 3 (serious accident). The Synthetic Minority Oversampling Technique (SMOTE) is an oversampling method designed to address class imbalance issues. It generates new synthetic samples by interpolating between existing minority class samples, thereby enhancing the learning effectiveness and classification performance of the classification model. The experiment is completed on a Core i9 CPU system equipped with 32GB RAM, and the stochastic gradient descent optimization algorithm is used for parameter updating. The hyperparameters are shown in
Table 20.
This paper uses LSTM-RNN to classify the water transportation accident data and compare it with MLP, Random Forest (RF), and XGBoost models.
Figure 16 shows the loss changes in the LSTM-RNN model during the training process. After 200 rounds of iterative training, the LSTM-RNN model already had a good performance. In the first 50 rounds of iterative training, the loss decreased slowly. The loss values decreased rapidly in the 50–125 rounds of iterative training, gradually slowed down and eventually stabilized after about 125 rounds of iterative training.
Figure 17 shows the trend in training accuracy of the LSTM-RNN model during 200 iterations of training. During the first 50 iterations of training, the accuracy increased slowly. Between 50 and 125 iterations, the accuracy increased rapidly. After approximately 125 iterations, the accuracy on both the training and test sets gradually slowed down and eventually stabilized.
Figure 18,
Figure 19,
Figure 20 and
Figure 21 show the confusion matrix of LSTM-RNN, MLP, XGBoost and RF models on the test set. The LSTM-RNN model demonstrated the best overall performance in recognizing the four types of accidents. Among them, the prediction accuracy for ordinary accidents was the highest, with 41 samples correctly identified and only 3 misclassified into other categories. For relatively serious accidents, 23 samples were correctly predicted, with only 1 misclassified. Of the 18 serious accident samples, 16 were correctly classified, with only 2 misclassified into other categories. Additionally, 16 minor accidents were accurately classified. The XGBoost model performed well in identifying relatively serious and serious accidents. The confusion matrix shows that the model successfully recognized all 24 samples of relatively serious accidents and achieved 17 correct predictions for serious accidents, with only 1 misclassification. However, its performance was slightly weaker in classifying minor and ordinary accidents, with 4 and 6 misclassifications, respectively. The random forest model also achieved 100% accuracy in identifying relatively serious accidents. However, its performance in predicting ordinary accidents was slightly inferior to that of the LSTM-RNN (35 correct and 9 misclassified). For minor accidents, 15 samples were correctly classified. Notably, this model misclassified 8 serious accidents as relatively serious accidents, indicating a relatively weaker ability to distinguish between adjacent severity levels. The MLP model performed less well compared to the other three models, especially in identifying ordinary and serious accidents, where its accuracy was relatively low. Only 29 ordinary accidents were correctly predicted, with as many as 15 misclassified. For serious accidents, 15 were correctly identified, and 3 were misclassified. While the model achieved some effectiveness in recognizing relatively serious (21 samples) and minor accidents (16 samples), its overall stability was insufficient, making it difficult to classify multi-category accident data accurately.
In order to further evaluate the comprehensive performance of different models in water traffic accident severity classification, this paper introduces four indicators, namely, accuracy, precision, recall and F1-score, to quantitatively analyze the prediction performance, as shown in
Table 21. The overall accuracy of the three models in accident severity classification does not exceed 90% except for the LSTM-RNN model. Among them, the LSTM-RNN model has the most superior performance with an accuracy of 92.31%, which is significantly higher than the other models. In contrast, the MLP model performed weakly at 77.08%. XGBoost and RF models achieved an intermediate level of 80% to 90%, respectively. In terms of precision, the LSTM-RNN model is still dominant at 91.84%, indicating its high reliability in predicting and its capability to effectively reduce false alarms. XGBoost and RF achieve an intermediate level of 89.12% and 82.40%, respectively, while the MLP model is only 79.00%. As for the recall rate, LSTM-RNN still leads with a level of 91.70%, followed by XGBoost (89.65%), while the RF and MLP models are at 79.61% and 81.84%, respectively, which suggests that both of them are somewhat deficient in terms of recognition completeness. F1-score, as the reconciled average of precision rate and recall rate, becomes an important indicator of the comprehensive performance of the model. The LSTM-RNN model with F1-score reaches 91.73% and is far more than the other models, indicating that it achieves a good balance between accuracy and generalization ability, and possesses stronger prediction capability.
According to the model performance curves shown in
Figure 22 and
Figure 23, a more in-depth analysis can be conducted regarding the discriminative capability and accuracy of different models. From the macro-averaged ROC curve, both the LSTM-RNN and XGBoost models achieve an AUC value of 0.98, demonstrating the strongest classification ability. This indicates that these two models possess highly accurate favourable rates and low false favourable rates when handling the classification of waterway traffic accident severity. The RF model yields an AUC value of 0.97, which is slightly lower but still indicates excellent performance. In comparison, the MLP model has an AUC value of 0.95, which is lower than the other models, suggesting that its capability to distinguish between different categories is relatively weaker. Further combining the macro-averaged PR curve, it can be found that LSTM-RNN and XGBoost are also at the top with a PR-AUC value of 0.94, indicating that both of them are able to maintain high prediction accuracy while ensuring the recall rate, which is suitable for tasks with high requirements for accurate recognition in real scenarios. The PR-AUC of RF is 0.90, with overall good stability, while the PR-AUC of the MLP model is only 0.86, showing that its performance is weaker when facing imbalance data, which is prone to detection or misjudgment. Based on both ROC and PR evaluation metrics, it can be concluded that the LSTM-RNN and XGBoost models exhibit superior overall discriminative capability, predictive accuracy, and stability.
Considering the results from the confusion matrix, accuracy, precision, recall, and F1-score, as well as the ROC and PR curves, the LSTM-RNN model, driven by graph features, leverages its excellent sequence modelling capability and robustness to multi-class imbalanced accident severity classification. Compared with traditional methods that rely solely on text vectors or statistical features for accident severity prediction, the ship collision prevention and control knowledge graph approach extracts entities and relationships from accident reports and forms a semantic structure [
67,
84]. The knowledge graph has complex network topology features. To evaluate the effect of incorporating knowledge graph topological features on accident severity prediction, this study systematically compares the performance of the LSTM-RNN model under various feature combinations. As shown in
Table 22, experimental results indicate that using only the Nodes characteristic as input, the model achieves an accuracy of 83.65% and an F1-score of 84.63%. The incremental introduction of individual topological components led to consistent performance gains: the combination of the Nodes characteristic with Betweenness centrality, Degree centrality, and Closeness centrality improved the F1-scores to 87.66%, 89.56%, and 89.65%, respectively. Notably, Closeness centrality provided the most significant boost to accuracy, reaching 90.38%, suggesting that the global proximity of entities within the knowledge graph is a critical factor in discriminating accident levels. Finally, when the Nodes characteristic and Topological features are jointly input, the accuracy reaches its peak at 92.31%, with the F1-score improving to 91.73%. This validates the significant advantages of the knowledge graph’s topological structure and demonstrates a clear synergistic effect among different topological dimensions in enhancing accident severity classification performance.
Although the optimized model achieved high accuracy on the test set, 6 samples were misclassified. To investigate the underlying causes of these errors, two representative misclassified cases (Case 74 and Case 91) were selected from the test set for qualitative analysis. This analysis is strictly based on the core features extracted from the knowledge graphs, namely the Nodes characteristic (representing the scale of entity nodes) and the Topological features (including degree centrality, betweenness centrality, and closeness centrality).
The analysis in
Table 23 shows that the primary source of error is the nonlinear mapping bias between the graph’s structural representation (Nodes characteristic and topological features) and the actual accident severity. Overestimation bias occurs when a “Relatively serious” accident has a complex narrative structure, leading to an inflated Nodes characteristic and betweenness centrality. This structural “busyness” misleads the model into predicting a higher severity level. Conversely, underestimation bias occurs when a “Serious” accident involves few entities and a simple causal chain, resulting in a restricted Nodes characteristic along with low degree centrality and closeness centrality. This structural “simplicity” masks the true severity of the accident’s consequences.
These rare edge cases highlight the inherent boundary of relying exclusively on pure topological structures for classification. However, even without introducing any additional semantic weighting, the current LSTM-RNN model achieved accurate classification in the vast majority (over 91%) of real-world test samples. This robust performance thoroughly validates that decoding unstructured accident texts into quantifiable knowledge graph topological features is a highly effective paradigm for accident severity assessment. It successfully captures the complex physical and causal networks underlying marine accidents. While mitigating the mapping bias in extreme long-tail samples—where structural complexity and semantic severity mismatch—would require integrating deep semantic embeddings, the current pure-structure framework has already demonstrated overwhelming superiority over traditional baseline models. It serves as a highly robust, scalable, and accurate tool for practical maritime traffic safety management.