1. Introduction
Ship behavior prediction serves as essential technical support for forecasting and early warning of maritime accidents, significantly improving the efficiency of maritime traffic supervision and reducing traffic risks. However, ship motion is influenced by numerous factors, and the challenges of information transmission at sea lead to noise in AIS data, making ship trajectory prediction more difficult compared to other fields [
1]. With increasing maritime traffic density, the navigational environment has become more complex, demanding higher standards for maritime safety oversight [
2]. In this context, the detection of anomalous ship behavior has become a vital tool for ensuring safety and combating illicit activities at sea [
3]. By leveraging algorithms to effectively identify deviations in ships’ navigation and behavioral patterns, this approach enhances overall waterway safety and strengthens the capacity to monitor and issue early warnings against potential violations. It provides an essential scientific and technical foundation for intelligent maritime governance.
An anomaly refers to an unusual occurrence or a deviation from the norm. Normal ship behavior typically involves movements or patterns consistent with standard maritime operations, such as steady navigation, docking, and routine port activities. In contrast, anomalous behavior is characterized by significant deviations from these expectations—for instance, erratic course changes, the disappearance of a vessel’s track, or unusual navigation routes. Such behaviors often indicate potential illegal activities or breaches of maritime safety regulations. Traditional methods for detecting ship anomalies primarily relied on manual monitoring by land-based personnel. However, with continuously growing vessel traffic and increasingly complex environmental factors, manual monitoring faces numerous limitations and is struggling to meet the demands of modern, intelligent maritime management [
4].
This paper focuses on multimodal data required for ship behavior perception. These datasets, oriented toward the same task or object, are derived from diverse perspectives or domains, with each modality offering distinct advantages and inherent limitations. Furthermore, underlying correlations exist among the multimodal data. By implementing complementary fusion of multimodal data and exploiting the intrinsic interrelationships across modalities, this approach enhances the robustness and accuracy of the integrated system.
- (1)
Multimodal Data Fusion
Current ship behavior anomaly detection primarily relies on AIS (Automatic Identification System), video surveillance, radar, and other modalities [
5]. The AIS broadcasts a ship’s static and dynamic information via a time-division self-organized dynamic network to nearby shore-based stations or other AIS-equipped vessels. While widely adopted, AIS inherently depends on the installation and proper operation of onboard terminals, rendering it a “passive” monitoring mechanism [
6]; video surveillance offers intuitive, reliable, and information-rich capabilities at a cost-effective rate, particularly in inland waterways. However, as unstructured data, video feeds only provide visual ship presence without granular ship-specific metadata. Additionally, video quality is highly susceptible to environmental factors such as illumination variations [
7]; maritime radar detects and tracks ships by analyzing target echoes, with widespread applications in coastal ports. Nevertheless, its utility in inland waterways is limited due to signal obstruction from mountainous terrain and curved shorelines along narrow channels [
8]; LiDAR (Light Detection and Ranging), typically deployed at waterway checkpoints or navigable bridges, enables the precise measurement of ship distances and 3D exterior profiles. However, its operational range is constrained, hindering long-distance detection. Moreover, LiDAR point cloud data often exhibits sparsity and spatial sampling inhomogeneity [
9]. In summary, single-source detection methodologies for ship anomaly behavior suffer from inherent limitations in data completeness and reliability [
10]. A comparative analysis of these sensor modalities is summarized in
Table 1.
- (2)
Ship Anomaly Detection Technologies
Current research in ship anomaly detection primarily focuses on AIS data-driven methodologies, encompassing classification/clustering algorithms, geometric feature extraction, and deep learning innovations [
11]. Classification approaches, such as the K-means clustering framework by Oruc et al. [
12], address class imbalance issues while eliminating reliance on inter-ship distance metrics to enhance safety response times for autonomous ships. Geometric feature-based methods, exemplified by Wijaya et al. [
13], analyze trajectory redundancy and curvature to distinguish normal and anomalous ship tracks.
In recent years, with the rapid development of artificial intelligence technology, neural network-based algorithms have gained increasing attention in ship anomaly behavior detection [
14,
15]. Deep learning techniques further advance the field: Rong et al. [
16] employ sliding window algorithms for anomaly detection, while Seong et al. [
17] integrate YOLOv7 and StrongSORT algorithms within a video-based graph framework to identify deviations from normal routes, particularly in narrow coastal zones. Zhang et al. [
18] explored semi-supervised learning with Graph Convolutional Networks (GCNs). Liu et al. [
19] introduced a spatiotemporal multi-GCN for ship trajectory prediction, while Zhao et al. [
20] developed a framework combining k-GCNs and LSTM for ship speed prediction. More recently, Wang et al. [
21] proposed a deep attention-aware spatiotemporal GCN for predicting ship trajectories. A key limitation of conventional GCNs is their use of fixed weights, which forces them to treat all neighboring nodes equally. This approach hinders the model’s ability to distinguish important nodes and to effectively leverage the diverse features present in AIS data.
Despite recent progress in ship anomaly detection, two key gaps remain for inland-waterway early-warning applications. First, many existing approaches still rely on a single sensing source (typically AIS or video) [
22], which limits robustness under real-world complexities such as AIS dropouts/noise, dynamic occlusion, and illumination changes. Second, even when graph-based methods are introduced, conventional GCNs often apply static neighborhood aggregation and do not explicitly model time-varying cross-modal semantic relations or support efficient online updates. These limitations hinder real-time and reliable anomaly detection in complex and evolving inland-waterway environments. The integration of multimodal data—AIS, visual feeds, and LiDAR point clouds—offers enriched feature representation, mitigates single-source limitations, and enhances detection robustness, positioning multimodal fusion as a critical frontier for improving anomaly detection efficiency and reliability [
23]. To bridge these gaps, we propose MF-GCN, which constructs an incremental multimodal graph aligned in time and space and learns dynamic correlations across AIS, video, LiDAR, and water level signals for accurate detection and early warning.
The main contributions of this work are summarized as follows.
(1) Incremental multimodal graph formulation: an incremental graph construction strategy aligning heterogeneous sensor streams with water level signals to enable online anomaly detection.
(2) Two-branch fusion architecture: S-GCN for semantic clustering-based historical context injection and A-GCN for attention-driven cross-modal relation learning.
(3) Real-World Dataset and Comprehensive Evaluation: Based on the constructed real-world dataset, a comprehensive evaluation was conducted, including systematic experiments and ablation analyses on three specific warning tasks: ship deviation warning, bridge-crossing warning, and inter-ship collision warning. The results demonstrate that the proposed method consistently and significantly outperforms representative baseline models.
2. Methods
2.1. General Workflow
This study processed ship perception data (AIS, LiDAR point clouds, video imagery) along with waterway elevation data into corresponding feature representations [
24]. LiDAR-derived features encapsulate the 3D spatial information of ships and their surrounding environments, while AIS features include real-time positional coordinates, velocity, and heading. Video data provides ship contour dimensions and relative positioning, and waterway elevation data is analyzed to extract water level trends and fluctuations. Initial preprocessing of video, LiDAR, and AIS data extracted critical ship attributes such as position, speed, and geometric profiles. Subsequently, LiDAR, video, and AIS features were temporally and spatially aligned with water level data using synchronized timestamps and geospatial coordinates. This alignment ensured spatiotemporal consistency across different modalities. As a result, it generated fused multimodal feature datasets [
25]. These integrated datasets unified 3D spatial context, kinematic trajectories, real-time dynamics, and environmental hydrology. Finally, the fused multimodal features were input into MF-GCN, as illustrated in
Figure 1,
Figure 2,
Figure 3,
Figure 4 and
Figure 5. MF-GCN leveraged these comprehensive features to detect anomalies such as ship deviations, collisions, and bridge strikes, thereby enhancing navigational safety and operational reliability.
2.2. MF-GCN Architecture
The Incremental Graph Convolutional Network (IGCN) can be represented as a sequence of graphs:
,
,
, …, where
denotes the graph at time
t. Nodes in the graph represent multimodal feature data, with
being the node set and
being the edge set. Here,
denotes the set of nodes at time
t, where
corresponds to the feature vector of sensor
[
26]. In
, an undirected edge
represents the relationship between sensors
and
. Edges are defined in two ways:
Temporally Ordered Edges: When sensor and exhibit explicit temporal dependencies (e.g., sensor ’s data precedes sensor ’s data ), the edge weight is set to 1. To avoid reliance on hardware-level precise synchronization, we adopt a soft synchronization strategy based on GPS timestamps to achieve controllable alignment. All sensors are calibrated to a unified UTC time reference. Using the sensor with the lowest sampling rate—typically AIS—as the reference timeline, we perform nearest-neighbor interpolation for higher-rate modalities at each AIS timestamp. Specifically, at each reference time *t*, the frame or point cloud whose timestamp is closest to *t* is selected for alignment. Additionally, a temporal deviation threshold is applied for quality control. Given the relatively slow movement speed of ships, we ensure that alignment errors remain consistently within 100 ms, which satisfies the precision requirements for behavioral analysis.
Semantic Edges: For nodes
and
without temporal order, edge weights are computed using cosine similarity between node features. To mitigate noise from weakly correlated edges, edges with similarity below a threshold of 0.75 are pruned [
27].
This framework dynamically constructs graph structures at any timestep based on node features and edge weights [
28].
The IGCN captures temporal evolution and cross-modal correlations, enabling more effective modeling of dynamic multimodal data compared to traditional architectures [
29,
30]. By treating anomaly detection as a node classification task, IGCN identifies events through integrated spatiotemporal patterns [
31].
LiDAR detects ship targets by emitting laser beams and generates three-dimensional point cloud data by measuring multi-line distances between the laser probes and the ship targets. In this paper, LiDAR point cloud data processing is accomplished through four components: Constraint of Action Range, Constraint of Echo Intensity, Constraint of Geometric Shape, and Feature Extraction of Ship Targets.
By utilizing the Optimization of Detection Network Architecture, Fusion of Scale Diversity, Design of Prior Anchor Boxes, and Attention Feature Extraction Network from our prior research on ship object detection in low-illumination environments [
7], the model can more accurately extract ship features from video image data.
AIS data is a type of structured data. However, due to communication or hardware failures, raw AIS data may contain some aberrant data and usually requires certain cleaning before it can be properly used. AIS data cleaning primarily involves identification data duplication, position data drift, missing trajectory data, and abnormality of steering data.
In article [
32], we incorporated a multi-head attention mechanism into the GRU model, enabling it to assign feature weights to critical factors such as temporal and spatial elements in waterway water level sequence data. This approach allows the model to focus on the key factors influencing changes in waterway water levels.
A multimodal data fusion method is proposed, which integrates LiDAR point cloud, video image, and AIS data to extract key ship features such as position, speed, and shape, and then calibrates and correlates them with corresponding spatiotemporal waterway water level information. This method treats water level data as a dynamic environmental background, combining real-time ship status with environmental conditions through spatiotemporal matching to analyze the impact of water level changes on ship navigation. This fusion architecture enables more accurate monitoring of ship status, assists in adjusting navigation strategies in response to water level variations, and ultimately enhances the reliability of ship monitoring and navigation safety. The specific fusion architecture is shown in
Figure 5.
Key Challenges and Solutions:
(1) Temporal Dependency Propagation: how to discover latent inter-node correlations at time t and propagate them to t + 1.
(2) Incremental Feature Learning: how to leverage prior temporal data to guide feature extraction for new nodes in evolving graphs.
To address these challenges, the proposed MF-GCN employs two specialized GCNs: Semantic Clustering-based GCN (S-GCN) identifies latent correlations through semantic similarity clustering, and Attentive Fusion-based GCN (A-GCN) dynamically fuses features using attention mechanisms. As illustrated in
Figure 6, S-GCN and A-GCN operate synergistically to enhance detection robustness. Subsequent sections detail their designs.
2.3. Branch Network I: Semantic Clustering-Based GCN (S-GCN)
This subsection introduces the workflow of S-GCN, which focuses on learning newly acquired ship features by leveraging semantic correlations between ship attributes and water level data. S-GCN first clusters historical ship features based on sensor-specific semantics, reconstructs an enhanced graph, and then performs graph learning to derive refined node features that embed sensor-level contextual information [
33]. The workflow is illustrated in
Figure 7.
Consider a system with
K sensors. For the IGCN, the graph
represents the monitoring state at time
. The vertex set
from time
is aggregated into
K clusters, one per sensor, via mean pooling operations. The fused feature for the
-th sensor is denoted as
. Edges between fused vertices
and new vertices
are recomputed based on cosine similarity between their feature vectors. The updated graph
is derived as follows:
where
denotes the set of newly added nodes at time
, and
represents the node set from time
. The fused cluster set
, derived from
K sensors via mean pooling, combines with
to form the updated node set
.
The edge set
is computed as
Here:
: Intra-edges among nodes in ;
: Inter-edges connecting to existing nodes;
: Edges between fused sensor clusters;
: Edges between fused clusters and new nodes.
S-GCN operates on
to refine node features by integrating ship attributes and water level data. The graph convolution is formulated as
where
: Normalized adjacency matrix for the -th sensor cluster;
: Trainable weights and biases at layer l;
: Non-linear activation function.
This process enriches new nodes with semantic context from historical clusters, enhancing detection accuracy by mitigating feature sparsity and noise [
34].
2.4. Branch Network II: Attentive Fusion-Based GCN (A-GCN)
As perception data is continuously collected, the number of nodes in the IGCN increases dynamically over time. At any timestep
, this section employs an Attentive Fusion GCN (A-GCN) with multi-head attention mechanisms to explore latent correlations between multimodal data [
35]. The workflow is illustrated in
Figure 8, and the methodology involves three key steps:
The multi-head attention mechanism captures diverse latent relationships between nodes by computing parallel attention heads, generating multiple feature representations for each node. These representations comprehensively reflect complex interdependencies among multimodal data [
36]. During graph learning, GCN updates node features to better encapsulate the system’s global state at time
. The fused features provide robust support for subsequent event detection and early warning.
While edge weights in the original graph
are computed via cosine similarity, such static measures fail to capture nuanced semantic relationships between sensor data. To address this, multi-head attention augments the original graph by generating
fully connected subgraphs [
37]. For each attention head
: initialize paired transformation matrices
and
.
The standard multi-head attention is defined as
where
,
, and
represent query, key, and value matrices, respectively, and
is the scaling factor. Multiple heads (
n) enable diverse relational modeling, mitigating biases from single-attention perspectives [
38].
To adapt this mechanism for graph augmentation, we substitute
with the adjacency matrix
, yielding
Here, (node features), and is the original adjacency matrix of . Each generates a fully connected subgraph , encoding distinct semantic relationships. Replacing with preserves the graph’s structural priors while injecting attention-driven semantics.
The
n fully connected subgraphs generated via multi-head attention undergo iterative feature refinement through graph convolutional layers [
39]. Let
denote the index of a vertex in the graph. To enhance vertex embedding, the update of vertex
at layer
incorporates both its initial features and aggregated historical embeddings from preceding layers, formulated as
where
: Concatenated feature vector of vertex at layer , combining its raw input feature with embeddings from layers 1 to .
: Embedding of vertex at layer .
The residual feature concatenation (Equation (8)) preserves historical context to mitigate gradient vanishing and enhance feature stability across deep layers.
For each vertex
, the feature update at layer
integrates information from all
subgraphs:
where
: Index of the subgraph.
: Attention-augmented adjacency matrix of the s-th subgraph.
,: Trainable weight matrix and bias term for the s-th subgraph at layer .
: ReLU activation function.
Following graph embedding, each node obtains
distinct embeddings from the
subgraphs. To unify these representations, mean pooling is employed to fuse the embeddings, producing a consolidated feature vector for each node:
This fusion step finalizes the multimodal association and integration within the GCN framework.
The final detection stage concatenates features from both branch networks: S-GCN (Captures water level and ship attribute correlations) and A-GCN (Encodes cross-modal attention dynamics). The concatenated features are processed through a fully connected layer followed by a softmax classifier for anomaly prediction algorithm.
2.5. Training and Testing Procedures
The objective of the training process is to learn the parameters of the S-GCN model and of the A-GCN model. The pseudocode is summarized in Algorithm 1, where
: S-GCN function leveraging ship attributes and water level data.
: A-GCN function for cross-modal feature integration.
| Algorithm 1: Training process |
| Input: Graph sequence: , Initial parameters & |
| Output: The learned GCN parameters & |
| 1: for do |
| 2: if then; |
| 3: ; |
| 4: ; |
| 5: ; |
| 6: , Break; |
| 7: else |
| 8: ; |
| 9: by Equation (1); |
| 10: by Equation (2); |
| 11: ; |
| 12: |
| 13: ; |
| 14: ; |
| 15: return The final parameters: & ; |
During inference, the framework processes incremental sensor data at each timestep to perform real-time event detection. The pseudocode for this phase is outlined in Algorithm 2:
| Algorithm 2: Testing process |
| Input: Graph sequence: , The trained parameters & |
| Output: The label prediction for each vertex |
| 1: for do |
| 2: if then; |
| 3: ; |
| 4: ; |
| 5: ; |
| 6: , Break; |
| 7: else |
| 8: |
| 9: by Equation (1); |
| 10: by Equation (2); |
| 11: ; |
| 12: ; |
| 13: ; |
| 14: ; |
| 15: ; |
| 16: ; |
| 17: return The final parameters: & ; | |
During each timestep, the framework focuses exclusively on the newly added node set to predict event labels for incremental perception data. f denotes classic graph convolution operation using pre-trained parameters and . denotes fully connected layer mapping fused features to event labels. Y is predicted labels for new sensor data.
3. Experiments and Results
3.1. Dataset Construction
Common anomalous ship behaviors in inland waterways include navigation deviations, collisions, overloading, and obscured ship identification [
40]. Due to the absence of public datasets for inland ship anomaly detection, this study collected multi-source data from the Jiangsu section of the Yangtze River, Hujiashen Line, and Hangshen Line (Jiaxing section) to construct a dedicated Ship Anomaly Event Detection Dataset. The dataset supports three critical tasks: ship deviation warning, ship bridge-crossing warning, and inter-ship collision warning. Data acquisition covered diverse environmental conditions (daytime, nighttime, and dusk) [
41]. A total of 500 multimodal event segments were constructed, including 47 abnormal segments and 453 normal segments. Each segment spans approximately 5 min and contains aligned AIS records, video frames, LiDAR point clouds, and water level measurements with one-to-one correspondence to the same ship event. Although abnormal events are inherently rare in real inland waterways, the proposed MF-GCN does not treat each 5 min segment as a single independent sample. Instead, the incremental graph formulation discretizes each segment into a sequence of time increments and performs event learning and prediction in a node-/timestep-wise manner. Therefore, each segment contributes multiple supervised graph updates (approximately 300/
T updates for a 5 min segment under time increment interval
T), which increases the effective number of learning instances beyond the raw segment count while preserving event-level labeling consistency. The sensor specifications used are as follows:
Video (HIKVISION, Hangzhou, China): 1920 × 1080 resolution @ 30 fps [
42];
LiDAR (RoboSense, ShenZhen, China): 192-line scanning, 300 m operational range, ±3 cm accuracy at maximum distance [
43].
AIS (Dalian Ninghang Communication& Navigation Co., Ltd., Dalian, China): Dual-channel (161.975/162.025 MHz) reception compliant with international standards, providing dynamic (position, speed) and static (ship ID, dimensions) data [
44].
A total of 500 multimodal event-type samples were collected in the observation area, including 47 positive samples (instances with annotated anomalous events) and 453 negative samples (normal operation status). Each sample has an average duration of approximately 5 min, and each sample segment contains four types of modal data: video data, point cloud data, AIS data, and water level data, which are one-to-one corresponding to ship events.
- (1)
Ship Deviation Warning Detection
Ship deviation warning detection primarily targets anomaly behaviors such as deviation from the planned waterway or deviation from the buoy area during ship navigation. Part of the dataset is shown in
Figure 9. For ship deviation warning detection, dataset collection was conducted at Hongxitang East Bridge and Xiadianmiao Bridge on the Hangshen Line Waterway, as well as the waterway upstream of Nanjing Yangtze River Bridge, with the data collection period ranging from 19 February to 2 July 2024.
- (2)
Ship Bridge-Crossing Warning Detection
For ships passing through bridges, full-course monitoring is required to prevent anomaly behaviors such as deviating from the planned route and colliding with bridge piers or decks. Part of the dataset is shown in
Figure 10. For ship bridge-crossing warning detection, dataset was collected at Beiyue Bridge on the Hangshen Line Waterway, Dongcai Bridge on the Hujiashen Line Waterway, and Yuhui Bridge on the Yechi Waterway, spanning from 6 March to 2 July 2024.
- (3)
Inter-Ship Collision Warning Detection
Inter-Ship collision warning primarily targets anomaly behaviors such as excessively close points or the closest point of approach between multi-target ships during navigation, which may easily lead to ship collisions. Part of the dataset is shown in
Figure 11. For inter-ship collision warning detection, field data acquisition was carried out in the waters near Nanjing Baguazhou Bridge over the Yangtze River, the navigation sections both upstream and downstream of Nanjing Yangtze River Bridge, and the waters adjacent to the Longtan Navigation Mark, with the data collection window from 8 May to 2 July 2024.
3.2. Event Detection Performance Evaluation
The MF-GCN algorithm was implemented using the PyTorch 1.9 deep learning framework in this study. A sliding window size was set to five to obtain optimal experimental results. During training, the Adam optimizer was employed, with a weight decay of 0.0001 and a learning rate of 0.0001. Network parameters were randomly initialized at the beginning of training. An ε-greedy strategy was adopted during training, with a total of 30 epochs conducted. In the first 10 epochs, ε was fixed at 1.0 to allow the agent to gradually learn the model parameters. Thereafter, ε was set to 0.1, enabling the agent to adjust model parameters based on learning experiences from its own decisions [
45]. Parameter updates were performed using stochastic gradient descent and backpropagation algorithms, with Dropout regularization applied during training. All experiments were conducted on a server configured with an NVIDIA 3090 GPU (Santa Clara, CA, USA) and an Intel i9 CPU (Santa Clara, CA, USA) [
46].
To ensure fair evaluation for imbalanced anomaly detection, we report Accuracy, Precision, Recall, and F1-score in our comparisons (
Table 2,
Table 3 and
Table 4) [
47]. In particular, Precision/Recall and F1-score are more informative than Accuracy when abnormal events are rare. In addition, since each event segment is discretized into multiple timesteps under the incremental graph formulation, data splitting is performed at the event-segment level to avoid temporal leakage (i.e., all timesteps derived from the same 5 min segment are assigned to the same subset), and the abnormal/normal ratio is kept consistent across subsets to the extent possible. The proposed method is evaluated on three warning tasks (ship deviation warning, ship bridge-crossing warning, and inter-ship collision warning) and compared with representative state-of-the-art methods; the experimental results are summarized in
Table 2,
Table 3 and
Table 4.
MF-GCN is designed for real-time monitoring via incremental inference. At each timestep, the framework focuses on the newly added node set and its local connections, rather than recomputing representations over the entire historical graph. For a typical GCN layer with feature dimension d, the message-passing cost is proportional to the number of edges involved in the current update (approximately ), where denotes the subgraph edges participating in the incremental update. Because a fixed sliding window is adopted in our implementation, the size of the subgraph participating in each update is bounded, and the per-step computation remains stable over long monitoring periods. This incremental-update property contrasts with full-graph recomputation, whose cost grows with the accumulated graph size, and therefore supports scalable deployment in continuous inland-waterway surveillance settings.
First, the proposed model was evaluated on the ship deviation warning detection dataset, with results shown in
Table 2. In the ship deviation warning detection task, the proposed method demonstrated significant advantages in all key metrics. Specifically, the method achieved an accuracy of 93.8%, a precision of 93.5%, a recall of 93.7%, and an F1 score of 93.6%, comprehensively outperforming other advanced models. Our model not only led by more than 2 percentage points in accuracy but also exceeded others by 2.2% and 2.3% in precision and recall, respectively, demonstrating its comprehensive advantages in reducing false alarms and capturing true deviation events and verifying its effectiveness and superiority.
Next, the results on the ship bridge-crossing warning detection dataset are shown in
Table 3. In the ship bridge-crossing warning detection task, the proposed method still demonstrated excellent performance in accuracy and F1-score. The method achieved an accuracy of 93.8% and an F1-score of 93.6%, significantly outperforming other models.
Finally, the results on the inter-ship collision warning detection dataset are shown in
Table 4. In the inter-ship collision warning detection task, the proposed method also demonstrated significant advantages in all evaluation metrics, achieving an accuracy of 93.3%, a precision of 93.8%, a recall of 92.8%, and an F1-score of 93.3%, representing significant improvements over other models. The performance improvement in the proposed method is attributed to the effective modeling of multimodal information correlations, enabling the model to more accurately detect events in complex multi-sensor data environments and verifying the effectiveness and superiority of the proposed method in ship warning detection tasks.
3.3. Ablation Experiments
To better explore the effectiveness of each component in the MF-GCN algorithm, an ablation study was conducted in this section. The experimental results for ship deviation warning detection are shown in
Table 5, where “−” indicates the removal of the corresponding component and “+” indicates its inclusion in the event detection and warning task.
The following conclusions can be drawn from the experimental results:
(1) A-GCN Outperforms S-GCN: A-GCN achieved a higher F1 score than S-GCN, because each sensor data provides more semantic information about new data, whereas S-GCN primarily focuses on the semantic-level correlations between sensor data. This study only used a mean-based method to fuse information from multiple sensors, leading to some information loss, which will be addressed in future work.
(2) Combined Model Achieves Optimal Results: The combined model demonstrated the best performance, indicating that the attention-based fusion GCN can compensate for the limitations of the semantic clustering GCN. Additionally, the multimodal information GCN provides a novel perspective for generating new data embeddings and captures information overlooked by the semantic clustering GCN.
3.4. Impact of Time Increment Interval on Model Performance
The effect of the time interval parameter
on model performance was investigated in this section. The time interval
determines the frequency of adding new perceptual data to the new graph
, i.e., the frequency of adding new nodes to the graph structure. The optimal time interval parameter was selected through comparative experiments. In the experiments,
was set to 5 s, 10 s, 20 s, and 30 s in sequence, with results shown in
Table 6.
In
Table 6, the column “S-GCN” represents the accuracy of ship deviation warning detection using only the S-GCN, and the column “A-GCN” represents the accuracy using only the A-GCN, while “Accuracy” and “F1” denote the experimental results of using both GCN layers simultaneously. The experimental results show that event detection performance increases within a certain range as the time interval increases. When
T = 5 s, the overall model achieved an accuracy of 92.7%.
Model performance improved as increased, reaching its optimum at T = 10 s with an accuracy of 93.8%. However, when T > 10 s, performance declined because the frequency of new sensor data added to the graph was too low to capture sufficient information, affecting graph learning.
The performance of S-GCN increased within a certain range as the update frequency of node numbers in the graph increased, but more new nodes also introduced more invalid potential correlations. Due to the lack of semantic information in these new nodes, they could not provide effective influence, thus negatively impacting graph embedding operations. Therefore, excessive perceptual data does not provide additional effective information for event detection.
A-GCN also demonstrated good performance, achieving 92.5% accuracy at T = 10 s, as the new graph considered the correlations between multimodal information. However, when T > 10 s, model performance did not significantly improve because the new data update frequency was too low to provide sufficient dynamic information. At T = 5 s, more new nodes increased the impact of new sensor data, which introduced uncertainty and affected final performance—consistent with the behavior of S-GCN. Therefore, it is necessary to concatenate the feature vectors of S-GCN and A-GCN to leverage their combined advantages.
3.5. Impact of Multimodal Information Volume on A-GCN Performance
The influence of the number of graphs in the multi-head attention mechanism of A-GCN, i.e., the volume of multimodal information, was further discussed in this section [
55]. This parameter affects the update of node features during graph convolution, with the number of graphs denoted as parameter
n. To determine the optimal parameter, experiments were conducted to explore the impact of
n on model performance. Parameter
n was varied from 2 to 10, and the results of ship deviation warning detection experiments are shown in
Figure 12.
As shown in
Figure 12 as
n increased from 2 to 10, the performance of A-GCN generally exhibited an initial increase followed by a decline. When
n = 2, the model achieved an accuracy of 90.5%. Performance gradually improved with increasing
n, peaking at
n = 8 with an accuracy of 93.8%. However, performance began to decline when
n > 8. This phenomenon can be attributed to two main factors:
(1) Increased Potential Relationships: More fully connected graphs provide additional potential correlations between sensor data for feature learning, leading to improved experimental results as n increases.
(2) Introduction of Noise: As the number of graphs increases, excessive potential relationships introduce noise, negatively impacting feature updates and causing performance degradation at higher n.
In summary, the multi-head attention mechanism effectively enhances latent correlations in embeddings, but performance does not strictly improve with increasing graph count.
3.6. Impact of Fusion Methods on S-GCN Performance
The influence of different fusion methods in S-GCN on the overall performance was discussed in this section. The mean pooling method was employed to generate sensor features in the new graph
. Additionally, other information fusion methods, such as max pooling and min pooling were conducted on ship deviation warning detection to evaluate their effects [
56]. The experimental results are shown in
Table 7.
During the experiments, the mean pooling method was first used to fuse data from different sensors, and the results showed that it performed best in terms of accuracy and F1 score, reaching 93.8% and 93.6%, respectively. This is because mean pooling can more evenly integrate data from each sensor, smooth outliers, and thus provide a more stable feature representation. Then, the max pooling method was tested, yielding an accuracy of 92.9% and an F1 score of 92.8%. The max pooling method tends to select the maximum value from each sensor’s data, which can capture significant features in some cases but may also ignore partial detailed information, resulting in slightly inferior performance compared to mean pooling. Finally, experiments with the min pooling method showed accuracies and F1 scores of 91.8% and 91.2%, respectively. The min pooling method tends to select the minimum value from each sensor’s data, which has certain advantages in handling noise and outliers but has poor overall feature representativeness; hence, its performance is inferior to the first two methods.
Based on these experimental data, the mean pooling method was found to be the most effective among different fusion methods, as it balances the influence of data from each sensor and provides a more reliable and stable feature representation. Therefore, the mean pooling was ultimately selected as the primary fusion method to improve the accuracy and reliability of ship event warning detection.
3.7. Impact of Edge Computation Formulas on Model Performance
This subsection investigates how edge computation formulas affect the proposed model. The similarity metric determining edge weights directly influences feature learning and model efficacy. We conducted experiments comparing three common similarity measures: Euclidean Distance (ED), Cosine Distance (CD), and Manhattan Distance (MD). Results are shown in
Figure 13.
Experimental results (
Figure 13) reveal that the choice of similarity metric, ED, CD, or MD, exerts only marginal influence on the model, with all three achieving >92% accuracy and F1-score (CD: 93.8%/93.6%, MD: 92.2%/92.6%, ED: 92.7%/92.3%). Notably, CD delivers optimal performance due to its directional sensitivity and invariance to feature magnitude, which better captures semantic relationships in high-dimensional multimodal data. Consequently, CD is adopted for edge weight computation in the final framework to maximize detection accuracy and stability while maintaining robustness against metric variations.
3.8. Visualization Analysis
The proposed MF-GCN was employed to visualize ship anomaly alerts. For a selected video segment, the algorithm analyzes each frame to output probabilities of ship deviation, ship bridge-crossing, and inter-ship collision, as demonstrated in
Figure 14.
For ship deviation warning, four sequential frames (first row in
Figure 14) show increasing deviation probabilities: 55% → 66% → 82% → 92%. This progression indicates escalating deviation risk, culminating in near-collision with a navigation buoy, likely caused by erroneous course adjustments. Such visual analytics enable real-time perception of deviation trends, providing critical alerts for maritime supervision, especially in narrow channels where timely corrections prevent hazardous entries.
For ship bridge-crossing warning, the probabilities of bridge collision across frames are 37% → 67% → 62% → 5% (second row in
Figure 14). The fluctuating risk reflects dynamic ship-bridge distance variations influenced by ship speed, heading, and water levels. High-risk periods (e.g., 67%) occur during bridge approach/transit, while post-transit risk dissipates (5%). This capability enhances situational awareness for infrastructure safety management.
For inter-ship collision warning, collision probabilities rise from 35% → 55% → 65% → 69% (third row in
Figure 14), signaling decreasing ship separation due to converging paths or operational errors. Visual trajectory overlays clarify spatial relationships and motion trends, enabling proactive collision avoidance.
The visualization framework intuitively quantifies three critical risks—ship deviation, bridge-crossing, and inter-ship collision—through probability visualization. By translating multimodal fusion results into actionable spatial–temporal insights, the system significantly enhances navigational safety decision making for maritime authorities.
4. Discussion
4.1. Interpretation of Results and Why MF-GCN Works
The experimental results across three inland-waterway warning tasks demonstrate that MF-GCN consistently outperforms representative baselines. In particular, MF-GCN achieves 93.8% accuracy for both ship deviation warning and bridge-crossing warning and 93.3% accuracy for inter-ship collision warning, with corresponding F1-scores of 93.6%, 93.6%, and 93.3%. This advantage is largely attributed to the complementary nature of heterogeneous sensing: AIS provides structured kinematic attributes; video offers intuitive visual cues; LiDAR supplies accurate geometric distance and 3D profile information; and water level signals capture environmental dynamics that can affect clearance and risk near bridges. By aligning these sources and learning their correlations within an incremental graph, MF-GCN reduces the failure modes of single-modality systems under occlusion, illumination changes, or AIS noise. The ablation study further validates the necessity of combining semantic context modeling (S-GCN) and attention-driven cross-modal fusion (A-GCN), while parameter studies show stable performance within a reasonable range of settings.
4.2. Limitations and Failure Modes
Although MF-GCN enhances the robustness of the system, several limitations remain. Under extreme adverse weather conditions that simultaneously affect multiple sensors, dense fog can significantly attenuate LiDAR returns, and heavy rain can severely degrade video visibility, potentially leading to a decline in model performance. Although the current dataset is collected from multiple waterways and covers various lighting conditions, it still reflects the practical scarcity of real abnormal event samples in real-world scenarios, meaning that rare edge cases may not be sufficiently represented. Additionally, cross-region generalization may be impacted by multiple forms of domain shifts, including differences in camera viewpoints, background clutter, variations in vessel distribution characteristics, and local hydrodynamic conditions.
4.3. Scalability and Practical Deployment Considerations
A key practical advantage of MF-GCN is its incremental inference mechanism. Instead of rebuilding a full graph when new sensor data arrives, the framework focuses on newly added nodes and their local connections at each timestep, which is more suitable for real-time monitoring. For deployment at scale, engineering factors include reliable multi-sensor timestamping and soft synchronization, communication latency and packet loss (especially for AIS), and edge–cloud collaboration. Model compression and hardware-aware optimization can further reduce computational requirements and facilitate deployment on mid-range GPUs or high-performance embedded devices.
4.4. Future Work: Simulation-Based Data Augmentation and Continual Learning
To address the limited availability of rare abnormal events, a promising direction is simulation-based dataset generation. We plan to construct a digital-twin inland-waterway environment to generate controllable abnormal scenarios including collision, grounding, and bridge-strike events under diverse weather and traffic conditions. Such synthetic data can complement real-world observations and improve coverage of edge cases. In addition, MF-GCN is compatible with continual learning: newly collected labeled events can be incrementally incorporated to update the model over time.
5. Conclusions
The experimental results validate the effectiveness of MF-GCN across three tasks: ship deviation warning, bridge-crossing warning, and inter-ship collision warning. The method significantly outperforms existing models, achieving accuracies of 93.8%, 93.8%, and 93.3%, respectively. Additionally, the study investigates the impact of key parameters on model performance, including the time interval T, the number of graphs in the multi-head attention mechanism n, fusion methods, and edge weight calculation formulas. The results indicate that the algorithm achieves optimal performance when = 10, = 8, mean pooling is applied for fusion, and cosine distance is selected for calculating edge weights. These findings confirm the accuracy and effectiveness of MF-GCN in complex scenarios, providing new insights for multimodal data fusion and dynamic feature learning. This study not only proposes an innovative multimodal graph fusion framework at the algorithmic level, advancing the development of heterogeneous perceptual information fusion and spatiotemporal modeling techniques, but also holds significant practical importance:
(1) Enhancing vessel navigation safety and abnormal behavior detection capabilities in complex environments: By fusing multi-source data such as AIS, video, LiDAR, and water level information, the system can more reliably identify abnormal behaviors such as deviation and collision risks. It demonstrates stronger robustness particularly under challenging conditions like occlusion and adverse weather, providing effective technical support for real-time vessel safety assistance and autonomous decision making.
(2) Supporting intelligent dynamic maritime supervision and waterway operational safety: The proposed architecture enables real-time alignment and incremental inference of multi-sensor information, contributing to the development of a more extensive and responsive intelligent monitoring system. It offers a scalable technical solution for maritime authorities to conduct comprehensive dynamic supervision, waterway traffic scheduling, and emergency incident management, thereby enhancing the overall safety and efficiency of waterway operations.