Next Article in Journal
Evolving Collective Intelligence for Unmanned Marine Vehicle Swarms: A Federated Meta-Learning Framework for Cross-Fleet Planning and Control
Previous Article in Journal
Statistical Grid-Based Analysis of Anthropogenic Film Pollution in Coastal Waters According to SAR Satellite Data Series
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Deep Ship Trajectory Clustering Method Based on Feature Embedded Representation Learning

Naval University of Engineering, Wuhan 430000, China
*
Author to whom correspondence should be addressed.
J. Mar. Sci. Eng. 2026, 14(1), 81; https://doi.org/10.3390/jmse14010081
Submission received: 6 December 2025 / Revised: 26 December 2025 / Accepted: 28 December 2025 / Published: 31 December 2025
(This article belongs to the Section Ocean Engineering)

Abstract

Trajectory clustering is of great significance for identifying behavioral patterns and vessel types of non-cooperative ships. However, existing trajectory clustering methods suffer from limitations in extracting cross-spatiotemporal scale features and modeling the coupling relationship between positional and motion features, which restricts clustering performance. To address this, this study proposes a deep ship trajectory clustering method based on feature embedding representation learning (ERL-DTC). The method designs a Temporal Attention-based Multi-scale feature Aggregation Network (TA-MAN) to achieve dynamic fusion of trajectory features from micro to macro scales. A Dual-feature Self-attention Fusion Encoder (DualSFE) is employed to decouple and jointly represent the spatiotemporal position and motion features of trajectories. A two-stage optimization strategy of “pre-training and joint training” is adopted, combining contrastive loss and clustering loss to jointly constrain the embedding representation learning, ensuring it preserves trajectory similarity relationships while being adapted to the clustering task. Experiments on a public vessel trajectory dataset show that for a four-class task (K = 4), ERL-DTC improves ACC by approximately 14.1% compared to the current best deep clustering method, with NMI and ARI increasing by about 28.9% and 30.2%, respectively. It achieves the highest Silhouette Coefficient (SC) and the lowest Davies-Bouldin Index (DBI), indicating a tighter and more clearly separated cluster structure. Furthermore, its inference efficiency is improved by two orders of magnitude compared to traditional point-matching-based methods, without significantly increasing runtime due to model complexity. Ablation studies and parameter sensitivity analysis further validate the necessity of each module design and the rationality of hyperparameter settings. This research provides an efficient and robust solution for feature learning and clustering of vessel trajectories across spatiotemporal scales.

1. Introduction

With the exponential growth of global shipping trade, activities of non-cooperative ships (e.g., illegal sand-dredging ships with disabled AIS, smuggling ships with disguised identities, and spy survey ships evading supervision) have become increasingly frequent. These ships often evade monitoring by concealing their identities or intermittently disabling positioning equipment [1,2], posing serious challenges to ecological protection and territorial security. For a long time, relevant authorities have acquired and accumulated vast amounts of trajectories through means such as shore-based radar detection and remote sensing monitoring. These data often contain spatiotemporal behavioral logic and activity patterns highly correlated with ship identity information [3,4]. Mining trajectory data can establish connections between ship navigation patterns and high-value information such as identity, which is of great value for maritime supervision [5], risk assessment [6], and route planning optimization [7].
As one of the core tasks of trajectory mining, trajectory clustering discovers ship behavior patterns and spatiotemporal aggregation characteristics in an unsupervised manner, thereby revealing ship activity patterns. This approach aligns with the challenges of missing labels and difficult annotation for non-cooperative ship trajectories [8], garnering widespread attention from researchers. The fundamental principle of trajectory clustering is to partition trajectory features into groups or clusters based on similarity metrics, ensuring intra-group homogeneity and inter-group heterogeneity [9]. Traditional trajectory clustering methods primarily rely on distance metrics between trajectory points, such as DTW [8], Hausdorff distance [10], and Fréchet distance [11], etc. The main drawbacks include sensitivity to noise and sampling rate [12] with high computational complexity. To address these issues, researchers have proposed deep trajectory clustering methods aimed at mapping the original trajectory into a low-dimensional space via dimensionality reduction mapping, representing them as fixed-length vectors while preserving their original similarity relationships. Clustering algorithms are then applied to these feature vectors to achieve trajectory clustering. This paper reviews and summarizes the research status of trajectory feature clustering methods in Section 2 (Literature Review).
However, existing methods still face challenges in feature extraction and representation, causing incorrect clustering division. Unlike urban road network-constrained traffic trajectories, ship trajectories exhibit significant spatiotemporal span variations. The salient features of different ship types and activities may manifest at different spatiotemporal scales (e.g., for cargo ships, macro-scale features are often more critical, whereas for fishing ships, micro-scale maneuvering characteristics require closer attention). Additionally, trajectories of different categories may intersect and overlap in spatiotemporal positions, with complex coupling relationships between location and motion features [12,13,14]. Existing methods struggle to establish comprehensive and accurate trajectory feature embedded representations simultaneously from multiple scales and dimensions, and the embedded representations generated by the model do not necessarily fit the clustering task, resulting in low clustering accuracy and affecting the judgment of ship behavior patterns or identity types. Furthermore, most deep clustering methods adopt an end-to-end decision-making mode [15], with limited research theoretically establishing the connection between the distance of trajectory feature representations in embedded space and the original trajectory similarity. Instead, they heavily rely on extensive experimental testing.
In response to the aforementioned problems, this paper proposes a deep clustering model adapted for cross-spatiotemporal scale ship trajectories: ERL-DTC. It employs a multi-scale feature aggregation network and a Dual-feature Self-attention Fusion Encoder (DualSFE) to accomplish deep embedding representation of multi-scale and multi-dimensional trajectory features, and subsequently uses the k-means clustering algorithm to cluster the feature vectors. During model training, the joint optimization of contrastive loss and clustering loss is implemented to achieve synergistic adaptation between feature embedding and the clustering task while preserving trajectory similarity. The main contributions of this paper are as follows:
(1)
This paper proposes a deep embedded representation learning method for ship trajectory features tailored for the clustering task. Centered around the Temporal Attention-based feature aggregation Network (TA-MAN), it dynamically adjusts the focus via learnable query vectors and retains key information at different scales through a layer-by-layer aggregation process, generating low-dimensional dense feature vectors. This addresses the difficulty of existing methods in simultaneously considering both local trajectory features and long-term temporal patterns. To tackle the problem of interference and ineffective interaction among different dimensional features of trajectories at the micro-scale, this study proposes decoupling trajectory motion and position features and utilizes a multi-head self-attention mechanism to construct coupling relationships among different dimensions, thereby forming a more comprehensive trajectory feature embedded representation.
(2)
This paper introduces the concept of contrastive learning to establish consistency between the similarity relationships of original features and embedded representations by reducing contrastive loss. A two-stage training strategy of “pre-training and joint training” is adopted, and a joint loss function comprising contrastive loss and clustering loss is designed to optimize model parameters and clustering centers. This simultaneously constrains trajectory feature similarity learning and the k-means clustering process, ensuring that the learned trajectory embeddings align with original features and guide the aggregation of similar samples. Clustering loss is used to sharpen inter-class boundaries, resulting in an embedded representation more conducive to clustering tasks and improving clustering performance.
(3)
The proposed method is evaluated using open-source trajectory datasets. Results demonstrate that the proposed method can achieve clustering with high accuracy and low computing power cost, and its effect is significantly better than the existing methods. Ablation studies and parameter sensitivity analyses confirm the necessity and effectiveness of the model and training design.
The remainder of this paper is organized as follows: Section 2 reviews related work on trajectory clustering; Section 3 defines the problem of trajectory features clustering; Section 4 describes the proposed ERL-DTC method; Section 5 presents experimental validation and analysis; and Section 6 concludes the paper and outlines future research directions.

2. Literature Review

Traditional trajectory clustering relies on similarity measurement methods based on local point matching, such as the Longest Common Subsequence (LCSS), Edit Distance on Real Sequence (EDR), Dynamic Time Warping (DTW), etc. [16]. These methods identify optimal alignments through dynamic programming and compute similarity between trajectories point by point, completing clustering using algorithms such as k-means and DBSCAN [17]. Zhao L et al. [18] proposed a ship trajectory clustering method based on the Douglas-Peucker compression algorithm and an improved DBSCAN. By calculating a DTW distance matrix and optimizing parameters adaptively, they significantly enhanced the clustering performance for large-scale, complex-distributed trajectories. However, these studies primarily focused on the positional features of trajectories, thus often being used for extracting customary routes. As research deepened, some scholars attempted to incorporate multidimensional features such as time, speed, and heading into similarity metrics to improve the characterization of ship motion patterns and behavioral semantics. Zhen R et al. [19] designed a similarity metric for ship behavior based on trajectory position and direction features. In the study of Zhou Y et al. [20], the trajectory path and ground speed were considered, and Euclidean distance was used to define ship behavior similarity. Based on an improved k-means algorithm, they clustered ship behaviors within port areas, with results reflecting both positional and speed variation patterns. To improve computational efficiency, Li H H et al. [8] utilized an adaptive Douglas-Peucker compression algorithm to simplify ship trajectory information, used DTW to measure trajectory similarity, and finally applied spectral clustering to extract ship traffic behavior characteristics and mine motion patterns. However, traditional trajectory clustering methods struggle to capture deep patterns in trajectories and are susceptible to noise interference, leading to suboptimal results. Tedjopurnomo et al. [16] extended four traditional trajectory similarity measurement algorithms (DTW, EDR, ERP, LCSS) to spatiotemporal dimensions using a balancing factor and conducted experimental analyses on public datasets. Compared with deep learning representation methods, traditional approaches revealed inherent limitations in dynamically balancing spatiotemporal features and scalability. Research by Chang Y et al. [21] indicated that the DTW algorithm is sensitive to minor disturbances in local trajectory points, and error accumulation can easily lead to distortion in global similarity measurement. Furthermore, the point-matching similarity calculation process requires at least quadratic time complexity relative to the number of trajectory points. Such high computational cost often makes it difficult to scale directly to large samples. For example, computing the Fréchet distance for 10,000 pairs of GPS vehicle trajectories on a high-performance server can take over 10 h [22].
To overcome the limitations of traditional clustering methods, some scholars have proposed deep trajectory clustering methods, which train neural networks via gradient descent under unsupervised conditions to extract latent discriminative features of trajectories. T2vec [23] achieved a deep feature embedded representation by maximizing the conditional probability from low-sampling to high-sampling trajectories using an LSTM-based autoencoder. The trained model encodes a trajectory into a dense vector, and the similarity between two trajectories can be approximated by some distance between their corresponding vectors, thereby reducing the original quadratic time complexity to linear complexity and significantly improving computational efficiency. Traj2vec [24] drew inspiration from Word2Vec technology in NLP applications [25,26], mapping motion information in ship trajectories into a low-dimensional space via a Seq2seq autoencoder, and then clustering the embedded representations using k-means algorithms, achieving performance advantages unattainable by traditional clustering methods. However, the model input was a sequence of statistical features extracted via a sliding window, resulting in the loss of some micro-motion features. Chen Ziwen et al. [27] proposed a representation learning-based spatiotemporal trajectory similarity computation model (RSTS). Through an improved RNNs-AE and spatiotemporal-aware loss function design, they quantified the spatiotemporal proximity relationships of ship grid cells, forcing the decoder to assign higher probabilities to spatiotemporally adjacent units, thereby learning representations that reflect real trajectories. Nevertheless, the above methods did not explicitly constrain the separability of features in the clustering space, and network optimization was conducted independently of the clustering process, resulting in embedded representations that might not align well with clustering tasks and could even lead to degenerate solutions. To address this issue, recent research has proposed joint optimization frameworks that collaboratively adapt feature learning and clustering objectives. Deep neural networks (DNNs) can be regarded as “regularization terms” for clustering models to constrain latent features and reduce the risk of degenerate solutions. For example, the deep clustering algorithm DTC [15] used a variational autoencoder (VAE) to generate latent variables and jointly optimized feature representation and cluster assignment in the latent space, achieving a high silhouette coefficient improvement on taxi trajectory datasets. To address the issue of uneven spatial distribution of trajectory points, Wang Chao et al. [28] proposed the DSTC framework based on this, achieving spatiotemporal trajectory representation learning and end-to-end clustering in a clustering-friendly space. However, Fang et al. [29] found in their study on deep trajectory clustering that even when the set number of clusters differed from the true value, the model could still output trajectory embedding results with high normalized mutual information (NMI), indicating that reconstruction loss might fail to maintain consistency between the similarity relationships of original samples and latent representations. Most existing studies assume that features extracted by the network can reflect trajectory similarity and rely on extensive experimental evaluation to prove the effectiveness of clustering results [30]. Few works have established effective constraints on the relationship between embedded representations’ similarity and the similarity of the original trajectory features. Additionally, LSTMs are affected by the vanishing gradient problem, making it difficult to effectively represent long-term temporal patterns and multidimensional feature coupling relationships when embedding long-duration trajectories. They are not suitable for ship trajectory clustering across spatiotemporal scales.

3. Modeling of Trajectory Clustering Problem

This section will introduce relevant definitions, such as trajectory similarity and feature embedded representation, and model the trajectory clustering problem.

3.1. Definition

Definition 1.
Ship Trajectory: A ship trajectory is defined as a time series of multiple trajectory points. A trajectory of length L can be expressed as  T i = p 1 , p 2 ,   , p L . For non-cooperative ship trajectories, each trajectory point typically includes only timestamp (Time), longitude (lon), latitude (lat), speed over ground (SOG), and course over ground (COG) [31]. The original information of the t-th trajectory point can be represented by a quintuple: p t = T i m e t , l o n t , l a t t , S O G t , C O G t . The trajectory set is denoted as χ = T 1 , T 2 , T N , where N is the number of trajectories.
Definition 2.
Trajectory Similarity: Trajectory similarity Sim(Ti,Tj) is a measure of the consistency of spatiotemporal motion patterns between two trajectories Ti and Tj, satisfying the following conditions:
(1)
Symmetry: Sim(Ti,Tj) = Sim(Tj,Ti);
(2)
Spatiotemporal consistency: If Ti and Tj have consistent spatiotemporal patterns, then Sim(Ti,Tj)→1; otherwise, Sim(Ti,Tj)→0;
(3)
Robustness: For noise disturbance δ, satisfied  T i T ˜ i δ   Sim ( T i , T j ) Sim ( T ˜ i , T j )   ε , where  T ˜ i  denotes the trajectory Ti after adding disturbance, and ε is the tolerance threshold.
Additionally, trajectory similarity must comprehensively consider spatiotemporal position, motion characteristics, and other factors presented by the target activities [20]. In this paper, trajectories with small spatiotemporal distances, similar motion states, and consistent temporal dependencies and spatiotemporal-motion coupling relationships are defined as similar trajectories.
Definition 3.
Trajectory Feature embedded representation: The process of converting original trajectory features into fixed-length vector representations is referred to as trajectory feature embedded representation [24]. Given a parameterized mapping function  f ϕ : L × 5 d , the original trajectory Ti is transformed into a d-dimensional dense vector e i = f ϕ ( T i ) , such that the distance dist ( e i , e j ) in the embedding space reflects the similarity of spatiotemporal patterns of the original trajectories. Typically, the larger dist ( e i , e j ) is, the smaller Sim(Ti,Tj) is, and the lower the similarity of spatiotemporal patterns between Ti and Tj. Since trajectory similarity is a measure of consistency in multi-dimensional information such as spatiotemporal and motion states, the embedded representation used for similarity analysis should contain latent information about time, position, motion state, and their interactions. In other words, the generated trajectory representation should simultaneously include both intrinsic trajectory features and pairwise relationships between trajectories.

3.2. Problem Modeling

The goal of ship trajectory clustering is to partition χ into K distinct clusters based on similarity, with each cluster representing a category of trajectory sets. The clustering process can be described as:
C l u s t e r ( χ ) = C l u s t e r ( { T 1 , T 2 , , T N } ) = { D 1 , D 2 , , D K }
where Dk denotes a set of trajectories belonging to the same category (similar trajectories). Taking the k-means algorithm as an example, its clustering objective function is:
arg max k = 1 K T i D k Sim ( T i , T k )
where Tk is the center trajectory of that cluster. The trajectory clustering process includes feature extraction and representation, similarity measurement, and cluster center computation, as shown in Figure 1. First, effective features are extracted from the trajectories to be clustered, and their similarity is calculated based on specific rules. Then, cluster centers are initialized according to the similarity measurement results and iteratively updated until convergence is achieved, resulting in sets of trajectories with similar patterns.
The point-matching computation process in traditional clustering methods leads to quadratic computational complexity O(N2), and feature extraction often relies on manual design, imposing many limitations. Therefore, recent studies have focused on using neural networks to extract trajectory features and achieve clustering based on similarity of embedded representation, i.e., deep trajectory clustering. In the trajectory feature embedding space, the clustering process and objective function are transformed into:
C l u s t e r ( χ ) = C l u s t e r ( { f ϕ ( T 1 ) , f ϕ ( T 2 ) , , f ϕ ( T N ) } )
arg min k = 1 K f ϕ ( T i ) S k dist ( f ϕ ( T i ) , μ k )
where f ϕ ( T i ) is the representation of trajectory Ti in the embedding space, Sk is the k-th cluster in the embedding space, and μk is the cluster center. According to ship motion analysis, trajectories are actually controlled by a small number of latent variables (e.g., thrust, resistance, turning angular velocity). Thus, ship trajectories satisfy the manifold assumption, meaning that high-dimensional observed data actually lie on a low-dimensional smooth manifold. Here, f ϕ ( T i ) can be regarded as the coordinates of Ti on this potential smooth manifold, reflecting similarity in motion patterns and customary routes. Let e i = f ϕ ( T i ) , this paper uses cosine similarity to measure the similarity between vectors in the embedding space:
dist ( e i , e j ) = e i e j e i e j
We need to embed the information of two trajectories Ti and Tj into low-dimensional vectors that reflect their similarity relationships, ensuring that the clustering division results of the embedded vectors align with the trajectory clustering results, i.e., simultaneously satisfying Equations (4) and (6):
arg min W dist ( e i , e j ) Sim ( T i , T j )
where W represents the network model parameters. The process of training the model is essentially fitting the nonlinear transformation f ϕ ( ) . Using the clustering results of feature embedded representations to represent trajectory categories significantly improves computational efficiency and reduces the sensitivity of clustering results to noise and sampling rates in original trajectories. Section 2 (Literature Review) has reviewed the latest research in this field. It is evident that most existing models are designed for urban traffic trajectory clustering tasks under road network constraints. In contrast, ship trajectories often exhibit large spatiotemporal span variations and complex multi-dimensional feature interactions. Below, we analyze this through a specific dataset.

3.3. Basic Characteristics of Ship Trajectory Data

Taking the trajectory data from the “Golden Dolphin Cup” Algorithm Challenge dataset, “XX Trajectory Intelligent Classification and Identification” as an example, we analyze the characteristics of ship trajectories. The data is organized and prepared from global AIS data and includes multiple types of ships [32]. We selected 1589 complete trajectories from 4 categories with sufficient samples. All trajectory points are arranged in chronological order, and the trajectory information includes timestamp, longitude, latitude, SOG, and COG. First, the raw trajectories of the four vessel types were preprocessed, including timestamp conversion, trajectory cleaning, and interpolation. The minimum number of trajectory points was set to 30, resulting in the removal of 536 abnormal/too-short trajectories. The interpolation threshold was set to 600 s. The processed vessel trajectory data (containing a total of 1053 complete trajectories from the four vessel types) served as the dataset for analysis and experiments in this study. The positional distribution of the trajectory data is shown in Figure 2.
Figure 2 illustrates the geographical distribution of trajectories for the four vessel types. Different colors in the figure represent trajectories of different vessel categories. It can be intuitively observed that Category D vessel trajectories have a larger spatiotemporal span, covering a wider maritime area. In contrast, the activities of Categories A, B, and C exhibit certain regional characteristics with relatively concentrated operational ranges: Category A vessels can navigate inland waterways; Category C vessels operate mainly along European coasts; while Category D vessels primarily operate along the coasts of China and surrounding countries. Regarding motion characteristics, the trajectories of the four vessel types also demonstrate a degree of separability. The statistical probability distributions of their Speed Over Ground (SOG) and Course Over Ground (COG) are shown in Figure 3 and Figure 4, respectively.
Figure 3 presents the statistical distribution of Speed Over Ground (SOG) for the four vessel types. The horizontal axis represents speed (kn), and the vertical axis represents probability, indicating the proportion of trajectory segments within that speed interval relative to the total number of segments. It can be seen that the SOG of Category A and C vessels is mostly maintained below 20 kn, suggesting they are likely predominantly coastal low-speed vessels or small inland waterway vessels. However, Category A trajectories contain a significant proportion of stationary (berthed) segments, while the SOG probability peak for Category C vessels occurs around 13 knots. On the other hand, Category B and D vessels exhibit higher speeds (>20 knots) in some segments, likely indicating they are primarily large cargo ships used for ocean-going transport.
Figure 4 displays the statistical distribution of Course Over Ground (COG) for the four vessel types in the form of polar sector plots. The angle (0°~360°) represents the course, and the radial length indicates the proportion of trajectory segments within that course interval. As can be seen from the figure, the course distributions for Category A and D vessels are relatively scattered and change frequently, indicating more maneuvering during navigation. In contrast, the courses for Category B and C vessels are relatively concentrated, suggesting more stable course-keeping.
Unlike urban spatiotemporal trajectories constrained by road networks, ship trajectories typically have wide spatiotemporal distributions, large variations in spatiotemporal spans, and overlapping spatiotemporal trajectories among different ship categories. There are strong coupling relationships between information of different dimensions (spatiotemporal position, motion features) and different time scales (micro, macro). Ignoring the modeling of motion features and temporal dependencies will inevitably lead to partial and distorted clustering results. This is also the main reason why many models that perform well on urban traffic trajectory datasets cannot adapt to ship trajectory clustering tasks. Therefore, there is an urgent need to develop a feature-embedded representation learning method that aligns with the requirements of ship trajectory clustering.

4. Methodology

To address the cross-spatiotemporal-scale ship trajectory clustering task, we proposed a deep ship trajectory clustering method based on feature embedded representation learning, referred to as ERL-DTC. Below, an overview of the method is provided first, followed by detailed descriptions of each component in the framework, and finally, the model training process is introduced.

4.1. Method Overview

Based on the analysis above, ERL-DTC needs to achieve multi-scale feature characterization, multi-dimensional feature fusion, and cluster-friendly embedding for vessel trajectories. Accordingly, following a progressive design philosophy, the model structure is devised as illustrated in Figure 5, encompassing trajectory augmentation, a temporal positional encoding module, TA-MAN, DualSFE, and feature clustering. Specifically, TA-MAN aims to address the challenges of capturing long-term dependencies and micro-scale maneuvering features in trajectories, while DualSFE is designed to model the complex, non-linear coupling relationships between positional and motion features, thereby avoiding feature confusion. By introducing a joint training strategy that combines contrastive learning with multi-objective clustering losses, the embedding space is ensured to both preserve original similarities and possess high cluster separability.
ERL-DTC employs the concept of contrastive learning, generating augmented trajectories via point distortion and down-sampling. It segments trajectories using temporal sliding windows. These trajectory segments are fed into a BiLSTM to model micro-scale temporal features while encoding temporal positions and intervals. Subsequently, a multi-layer temporal attention aggregation mechanism is leveraged to achieve feature aggregation from micro to macro scales. Through DualSFE, adaptive fusion of positional and motion features is accomplished. The output of this stage, combined with statistical features of the entire trajectory, is processed by a gated fusion module to obtain the final embedded representation of the complete trajectory. The model adopts a combined strategy of pre-training and joint training, constraining the feature space distribution jointly with contrastive loss and clustering loss. Finally, the k-means clustering algorithm is applied to categorize the trajectories.

4.2. Trajectory Augmentation

To simulate noise interference and data missing in real-world scenarios, this paper employs two strategies to generate corresponding ship trajectories: point distortion and down-sampling. On one hand, this enriches the trajectory dataset, significantly enhancing the model’s generalization capability across diverse complex trajectory scenarios. On the other hand, based on the concept of contrastive learning, trajectory augmentation creates varied trajectory variants for the encoder, encouraging it to capture both common and distinct features among these variants during subsequent training, thereby strengthening the robustness and discriminative power of trajectory feature embedded representations.
Point Distortion: For a given ship trajectory T i , point distortion involves applying random offsets to each trajectory point, aiming to learn similar trajectories with minor numerical feature variations. The trajectory after point distortion, denoted as T i , and its trajectory points p t can be expressed as:
T i = p 1 , p 2 ,   , p L
p t = T i m e t , l o n t + Δ l o n , l a t t + Δ l a t , S O G t + Δ S O G , C O G t + Δ C O G
Here, Δ l o n ~ N ( 0 , σ l o n 2 ) , Δ l a t ~ N ( 0 , σ l a t 2 ) , Δ S O G ~ N ( 0 , σ S O G 2 ) , and Δ C O G ~ N ( 0 , σ C O G 2 ) . They are random noises following specific Gaussian distributions in the four original feature dimensions. The standard deviation σ is determined based on the fluctuation range of features and measurement accuracy. This article sets σ l o n = σ l a t = 0.0005 ° , σ S O G = 0.5 kn , σ C O G = 2.0 ° .
Down-sampling: Down-sampling generates incomplete trajectory sequences through a point masking mechanism. Specifically, for each trajectory point of a given original trajectory Ti, a random deletion operation is performed with probability P, where the probability for each point satisfies an independent and identically distributed uniform distribution (This article sets P = 0.25). Furthermore, considering the importance of temporal continuity of trajectory points for feature expression, a sliding window constraint mechanism is adopted during point deletion: if the number of continuously deleted points within a window exceeds a threshold, the deletion operation is re-executed to ensure that the trajectory does not exhibit excessively long fragmented segments. This mechanism enhances data diversity while avoiding destruction of the core dynamic characteristics of the trajectory, providing more challenging training samples for the model and encouraging the encoder to learn the global structure and local feature correlations of sparse trajectories.

4.3. Temporal Positional Encoding

The temporal position and time intervals of ship trajectory points are crucial for understanding the dynamic changes in trajectories. Traditional single positional encoding often neglects time interval information, which results in the loss of substantial useful information for non-uniformly sampled trajectory points. This study adopts a combined strategy of absolute temporal position and relative temporal position encoding, aiming to capture temporal information while addressing the issue of uneven time intervals between trajectory points.
Inspired by positional embedding in Transformer [33] and BERT [34], this paper adopts sine and cosine functions for absolute positional encoding, assigning a unique temporal position identifier to each trajectory point. Given a trajectory sequence of length L, for the i-th trajectory point, its absolute positional encoding P E a b s ( i , j ) in the d-dimensional feature space is calculated as follows:
P E a b s ( i , j ) = sin ( i / 10000 j / d ) ,   j is even P E a b s ( i , j ) = cos ( i / 10000 ( j 1 ) / d ) ,   j is odd
where j = 0 , 1 , , d . Through the combination of sine and cosine functions with different frequencies, the model can learn the absolute position information of trajectory points in the time series.
To characterize the temporal interval relationships between trajectory points, this study introduces relative positional encoding. Let the time interval between trajectory point p i and the starting point be Δ t i . A linear transformation layer Liner used to map the relative time interval Δ t i to a relative position vector P E r e l ( Δ t i ) , i.e., the relative temporal position encoding of trajectory point p i relative to the starting point is:
P E r e l ( Δ t i ) = Liner ( Δ t i )
The final temporal-positional encoding for trajectory point p i is:
P E ( i , j ) = P E a b s ( i , j ) + P E r e l ( Δ t i )
This approach simultaneously encodes the sequential order and relative time interval information of trajectory points.

4.4. Temporary Attention Based Multi-Scale Feature Aggregation Network

Ship activity patterns exhibit distinct characteristics across different temporal scales. For instance, frequent course changes (corresponding to the micro-scale) may characterize fishing ship operations, while stable long-distance routes (macro-scale) may represent cargo ship transportation. Traditional feature embedding methods struggle to simultaneously focus on these key pieces of information across scales. Therefore, this study adopts a strategy of layer-by-layer abstraction and hierarchical aggregation to design a Temporal Attention-based Multi-scale feature Aggregation Network (TA-MAN). It utilizes a BiLSTM encoder to extract micro-scale features, followed by temporal attention to achieve multi-scale feature embedded representation.

4.4.1. BiLSTM-AE

First, a complete trajectory is sliced into a sequence of trajectory segments using a temporal sliding window with a specific step size. Subsequently, a BiLSTM encoder is employed to encode the features of each trajectory segment to characterize their temporal dependencies at the micro-scale.
LSTM is a variant proposed to solve the gradient vanishing or explosion problems in RNNs when learning long-term dependencies [35]. It uses an input gate to determine the retention degree of current input information, a forget gate to control the retention or discarding of historical information in the cell state, and an output gate to determine the final output based on the cell state and current input. For specific implementation details, refer to the literature of [36]. BiLSTM consists of a bidirectional LSTM structure where the input sequence is fed in both forward and reverse orders. The output is influenced by inputs and hidden states from both past and future time steps. This design aims to model the temporal dependencies of trajectory features bidirectionally. Its principle is illustrated in Figure 6.
Based on the BiLSTM network, the BiLSTM encoder is designed as two independent LSTM structures that map the input sequence into a forward hidden sequence h t and a backward hidden sequence h t , respectively:
h t = LSTM ( x t , h t 1 ; W )
h t = LSTM ( x t , h t + 1 ; W )
Here, W and W are the weight matrices of the forward and backward layers, respectively, and xt is the feature sequence of the trajectory segment (including positional and motion features). The forward and backward hidden sequences output by the encoder are fed into the same output layer and concatenated into a compact bidirectional representation h t = [ h t , h t ] . Using parameter-sharing BiLSTM encoders, the sequence of trajectory segments obtained by slicing the entire trajectory is encoded to produce a micro-level trajectory feature embedded representation sequence H m i c r o = { h 1 , h 2 , h m } , where m is the number of sliding windows.

4.4.2. Temporary Attention Based Multi Scale Feature Aggregation Mechanism

The essence of multi-scale feature aggregation is to construct a hierarchical feature extraction process that captures key trajectory information ranging from local maneuvers to global behavioral patterns. In this study, this process is realized through a two-level (local and global) temporal attention design, which employs dynamic weight assignment to perform weighted summation on feature sequences from different hierarchical levels. The principle is illustrated in Figure 7.
In this research, local temporal attention is used to aggregate feature embedded representations at the micro-level. First, temporal positional encoding information is added to the output ht of BiLSTM to obtain h ˜ t :
h ˜ t = h t + P E ( t )
A local attention mechanism is introduced. By computing local attention weights α t through the interaction between a learnable query vector q l T and the features, the model achieves an adaptive assessment of the importance of short-term temporal segments. The vector g k obtained after weighted summation can be regarded as a “summary” of a contiguous block of micro-scale features, representing the meso-scale semantic characteristics of the trajectory. Given the input trajectory feature sequence H ˜ m i c r o = { h ˜ 1 , h ˜ 2 , , h ˜ m } , it is divided into non-overlapping groups of three (zero-padding is applied for any remaining sequences with insufficient length), and the intra-group attention weights α t for each group are calculated:
α t = soft max ( q l T h ˜ t / d ) , t { 3 k 2 , 3 k 1 , 3 k }
Here, t is the time step, k = 1 , 2 , , n , q l T is a learnable local query vector, and d is the feature dimension. The local summary vector g k is generated as:
g k = t α t h ˜ t , t { 3 k 2 , 3 k 1 , 3 k }
Thus, the meso-level feature sequence G = { g 1 , g 2 , , g n } of the trajectory is obtained.
Similarly, taking the sequence of these summaries G as input, global attention weights β k are calculated via a global query vector q g T and a learnable linear transformation matrix Wg (as shown in Equation (17)). It further filters out the most decisive information for the long-term behavioral patterns of the entire trajectory (such as habitual routes and activity areas) and performs a secondary aggregation (as shown in Equation (18)), forming a macro-level feature embedded representation R = { r 1 , r 2 , , r s } that reflects the long-term temporal dependencies of the trajectory.
β k = soft max ( q g T W g g k / d ) , k { 3 h 2 , 3 h 1 , 3 h }
Here, h = 1 , 2 , , s . The weighted aggregation yields the global representation r h :
r h = k β k g k , k { 3 h 2 , 3 h 1 , 3 h }
The design of TA-MAN adopts a progressive aggregation structure from micro to macro scales. This effectively mitigates the memory decay problem inherent in traditional RNN models when processing long sequences. Furthermore, it provides an interpretability perspective for model decisions through the attention weights.

4.5. Dual-Feature Self-Attention Fusion Encoder

The category of trajectories is closely related to multidimensional factors such as spatiotemporal position and motion characteristics. However, previous studies only simply concatenated them and input them into the network, resulting in an increase in the input dimension and processing difficulty. Additionally, dimensional differences and noise can cause feature conflicts, affecting the transmission of useful information. To address this issue, we design a Dual-Feature Self-Attention Fusion Encoder (DualSFE). On the one hand, in the feature input layer, the trajectory position and motion features are decoupled and the intrinsic patterns of the two features are modeled separately through TA-MAN to avoid mutual interference, generating independent embedded representation sequences R p ,   R m s × d , where R p is the position feature sequence and R m is the motion feature sequence. On the other hand, by modeling the dynamic correlation between position and motion features through the dual-branch self-attention mechanism, a more comprehensive trajectory embedded representation is generated, thereby avoiding information confusion caused by rigid concatenation. Referring to the overall structure of the Transformer encoder, the DualSFE is designed as shown in Figure 8.
The core of the encoder lies in the design of the Dual-Feature Self-Attention Module (DFSAM). As shown in Equations (19) and (20), the position and motion features are first transformed into their respective Query ( Q ), Key ( K ), and Value ( V ) matrices through linear projection, and independent representation subspaces are constructed for the two types of features while ensuring that their unique information is not confused.
Q p = R p W p Q , K p = R p W p K , V p = R p W p V
Q m = R m W m Q , K m = R m W m K , V m = R m W m V
Here, W Q , W K , W V are all learnable parameter matrices. The multi-head attention mechanism essentially maps trajectory features to different subspaces, with each head focusing on a specific aspect of the features. To enable the model to learn the spatial dependency relationship between position points and the temporal evolution relationship between motion states, the scaled dot product is used to calculate the attention weights of position features and motion features separately:
A p i = soft max ( Q p i K p i d / N u m _ h )
A m i = soft max ( Q m i K m i d / N u m _ h )
where Num_h is the number of heads, and A i is the attention coefficient matrix for the position (motion) features of the i-th head, representing the correlation between different points in terms of position (motion) features.
DFSAM introduces a learnable parameter γ to dynamically adjust the degree to which the final feature representation should rely on the correlation between position and motion features, which is crucial for identifying different types of ship patterns: for example, identifying ships operating in fixed areas may rely more on position patterns, while identifying specific maneuvers may rely more on motion patterns. The final fusion weight calculation is:
A p m i = A p i + γ A m i
The single-head output fused feature can then be obtained:
O p m i = A p m i × V p i
The outputs of all heads are concatenated to form the complete output of DFSAM O p m :
O p m = ( O p m 1 | | O p m 2 | | O p m 3 | | O p m h ) W o
where | | denotes concatenation, and W o d × d is a learnable weight matrix.
In terms of the encoder structure, DualSFE adopts a multi-layer stacked design. Each layer comprises DFSAM, residual connection (Dropout), layer normalization (LayerNorm), and a feed-forward neural network (FFN). First, the input feature sequence undergoes dual feature fusion via DFSAM, followed by Dropout and LayerNorm to stabilize the training process and improve the generalization ability of the model:
O ^ p m = LayerNorm ( Dropout ( O p m ) + R p )
Then, higher-order features are extracted via FFN and normalized:
H p m = LayerNorm ( Dropout ( FFN ( O ^ p m ) ) + O ^ p m )
The FFN consists of a two-layer MLP with ReLU activation function. Finally, average pooling (AvgPool) is applied to generate the comprehensive embedded representation H ^ p m d of the trajectory features.
To further enhance the completeness of the embedded representations, we combine the automatically extracted features of the encoder with manually designed statistical features through a gating mechanism, including latitude/longitude span, mean and standard deviation of speed and course, total displacement, number of trajectory points, total cumulative time, etc. The statistical features are standardized and then mapped to the same dimensional space as H ^ p m via a fully connected layer, yielding H s t d . Then, H s t and H ^ p m are concatenated:
z = ( H ^ p m | | H s t ) 2 d
The gating mechanism is used to achieve adaptive feature fusion:
f = σ ( W f z + b f ) d
e = f H ^ p m + ( I f ) H s t d
where f is the fusion weight, Wf is the weight matrix, bf is the bias term, σ ( ) is the sigmoid activation function constraining the weight to [0, 1]. is the Hadamard product, I is an all-one vector, and e is the final embedded representation of the trajectory features.

4.6. Model Training

This paper adopts a two-stage training strategy of pre-training and joint training to optimize trajectory embedded representation learning and the clustering process, as illustrated in Figure 9. In the pre-training stage, a trajectory feature embedding space is constructed based on the contrastive loss, and the k-means algorithm is executed to initialize the cluster centers. In the joint training stage, clustering losses are introduced. The neural network parameters and cluster assignments are updated alternately. This allows the cluster centers to converge rapidly while gradually adjusting the feature space, promoting tight aggregation of similar samples and clear separation of dissimilar samples within the embedding space.

4.6.1. Pre-Training

Previous research predominantly used reconstruction loss to build the initial feature space. However, the goal of a clustering task is not data reconstruction fidelity, but rather to obtain more distinctive categorical features. The core optimization objective of contrastive loss is to learn invariant representations, meaning that samples with similar features should be encoded to maintain high similarity. This aligns perfectly with the fundamental requirements of clustering tasks. Therefore, in the pre-training stage, this paper employs a contrastive loss function. By constructing positive and negative sample pairs, it optimizes the similarity structure of the feature space. A modified InfoNCE loss function [37] is used as the optimization objective to maximize the similarity between features of two samples from the same class and the dissimilarity between features of two samples from different classes, thereby guiding the model to learn more discriminative and structurally superior feature representations.
For the trajectory sample set χ = T 1 , T 2 , T N , the corresponding augmented sample set χ + = T 1 + , T 2 + , T N + is generated through trajectory augmentation. The augmented sample for trajectory T i is T i + . The feature vector sets output by the encoder are E = { e 1 , e 2 , e N } and E + = { e 1 + , e 2 + , e N + } , respectively. The contrastive loss function is defined as:
L c o n t = 1 N i = 1 N log exp ( dist ( e i , e i + ) / τ ) j = 1 N [ exp ( dist ( e i , e i + ) / τ ) + I i j exp ( dist ( e i , e j ) / τ ) ]
where τ > 0 is the temperature parameter controlling the sharpness of the distribution, dist ( ) denotes cosine similarity, and I i j is an indicator function to exclude self-comparison.
The pre-training stage continues until L c o n t converges. As the loss function exerts a typical “push-pull” effect on the gradients of the feature vectors, with decreasing L c o n t , positive sample pairs move closer together in the feature space while negative sample pairs move apart, making the feature embedded representations of different categories of trajectories distinguishable.

4.6.2. Joint Training

To map the general trajectory representations obtained from pre-training into a clustering-friendly space, a multi-objective loss function is constructed:
L j o i n t = λ L c o n t + ( 1 λ ) L c l u s t e r
where L c l u s t e r comprises four clustering loss terms, including k-means loss Lkm, soft assignment loss Lsoft, inter-cluster distance loss Linter, and neighborhood consistency loss Lnb. λ [ 0 , 1 ] is a balancing coefficient.
First, the k-means loss is used to force samples to move closer to their respective cluster centers [38], defined as:
L k m = i = 1 N k = 1 K b i k e i μ k 2
where b i k { 0 , 1 } is a Boolean variable assigning the embedded representation ei of trajectory Ti to the k-th cluster center μ k , N is the number of samples, and K is the number of clusters. The k-means loss improves clustering quality by minimizing the distance between sample points and their assigned cluster centers.
The introduction of Lsoft is to enhance the guiding effect of high confidence samples in clustering on model training. The soft assignment loss uses Student’s t-distribution to measure the similarity between samples and cluster centers [29]. The probability of trajectory Ti being assigned to cluster Sk is defined as:
q i k = ( 1 + e i μ k 2 / v ) v + 1 2 k ( 1 + e i μ k 2 / v ) v + 1 2
where ν is a constant, and k is a dummy variable in the traversal summation process (similarly for i below) to avoid conflict with k in the numerator. A nonlinear transformation is applied to qik to compute the target distribution P, thereby enhancing the weight of high-confidence samples and balancing cluster sizes. The probability pik in P is calculated as:
p i k = q i k 2 / i q i k k ( q i k 2 / i q i k )
By optimizing the difference between the target distribution and the predicted distribution through KL divergence (as shown in Equation (36)), high confidence samples are brought closer to the cluster center, thereby guiding the clustering process to be more robust and avoiding noisy samples dominating the update of the cluster center.
L s o f t = KL ( P | | Q ) = i k p i k log p i k q i k
The inter-cluster distance loss Linter is introduced to increase the differences between different clusters. In this research, we primarily consider maximizing the distance between cluster centers. Linter is designed as:
L i n t e r = i k exp ( μ i μ k 2 )
The main role of the neighborhood consistency loss Lnb is to ensure that the model preserves the local manifold structure during feature mapping, meaning that a sample point should have the same label as its n nearest neighbors in the same cluster. Lnb is designed as:
L n b = i j X n ( i ) b i j ( e i e j ) 2
where X n ( i ) is the set of n nearest neighbors of the i-th sample, and b i j { 0 , 1 } is a Boolean variable indicating whether two samples belong to the same cluster.
Finally, calculate L c l u s t e r according to the weight ratio λ 1 : λ 2 : λ 3 : λ 4 = 3 : 3 : 2 : 2 .
L c l u s t e r = λ 1 L k m + λ 2 L s o f t + λ 3 L i n t e r + λ 4 L n b
where λ 1 + λ 2 + λ 3 + λ 4 = 1 .
In the joint training stage, the embedded feature space is “fine-tuned” by introducing the aforementioned four complementary clustering losses. L k m and L s o f t primarily drive intra-cluster aggregation, L i n t e r drives inter-cluster separation, and L n b preserves feature local consistency. This design enables the model to jointly learn feature representations and cluster assignments, ultimately generating a clustering-friendly embedding space. The training incorporates an “early stopping” mechanism, where training is terminated if performance does not improve for 10 consecutive training epochs, preventing the model from over-optimizing on the training set.

5. Experiments

In order to verify the effectiveness of the proposed method, this paper conducted extensive experimental analysis using the dataset introduced in Section 3.3. The following will introduce the evaluation metrics, baseline methods, and experimental results.

5.1. Evaluation Metrics

This article uses three types of external metrics widely used for clustering effect evaluation and two types of internal metrics to test the performance of the method. The former requires known real category labels as a reference, while the latter only evaluates based on the data distribution of the clustering results themselves, without relying on real labels.

5.1.1. Metrics Based on External Labels

Accuracy (ACC): The clustering calculation results usually do not include label correspondence. Therefore, calculating accuracy requires optimal alignment between predicted cluster labels and true labels. The Hungarian algorithm eliminates permutation ambiguity between predicted labels and true labels through optimal assignment, providing an automated label alignment method. The calculation process is as follows.
(1)
Construct Contingency Matrix
Let the true class label set be Y = { y 1 , y 2 , , y K } , and the predicted cluster label set be Y ^ = { y ^ 1 , y ^ 2 , , y ^ K } . A K × K contingency matrix C is constructed, where each element C i j = { x p | y p = y i y ^ p = y ^ j } represents the number of samples belonging to the true class y i and assigned to the predicted cluster y ^ j . The matrix reflects the distributional relationship between predicted clusters and true classes.
(2)
Construct Cost Matrix and Hungarian Algorithm Optimization
The contingency matrix C is transformed into a cost matrix W, where W i j = N C i j . The Hungarian algorithm is used to find a mapping σ : Y Y ^ that minimizes the total cost, i.e., solving:
min σ i = 1 K W i σ ( i ) = min σ ( K N i = 1 K C i σ ( i ) )
This is equivalent to maximizing the number of correctly matched samples i = 1 K C i σ ( i ) . The algorithm achieves the optimal solution through iterative reduction in row/column minima and augmented path search. For detailed computation, refer to the literature of Kuhn [39].
(3)
Calculate Accuracy
The number of correctly classified samples under the optimal matching σ ( ) is i = 1 K C i σ ( i ) . The clustering accuracy is:
ACC = 1 N i = 1 K C i σ ( i )
This metric objectively reflects the alignment between clustering results and true classifications by eliminating label permutation uncertainty.
Normalized Mutual Information (NMI): NMI quantifies the shared information between true labels and predicted labels. Let the joint probability distribution between the true label set Y and predicted label set Y ^ be p ( y , y ^ ) , with marginal distributions p ( y ) and p ( y ^ ) . The mutual information is defined as:
I ( Y ; Y ^ ) = y Y y ^ Y ^ p ( y , y ^ ) log p ( y , y ^ ) p ( y ) p ( y ^ )
To avoid the influence of the number of clusters, NMI constrains the value to the [0, 1] interval via normalization:
NMI ( Y , Y ^ ) = I ( Y ; Y ^ ) mean ( H ( Y ) , H ( Y ^ ) )
Here, H ( ) denotes marginal entropy. mean ( ) represents the average value. The closer NMI is to 1, the more ideal the clustering result.
Adjusted Rand Index (ARI): ARI measures the statistical consistency between clustering results and true labels at a macro level. Its formula is:
ARI = RI E ( RI ) max ( RI ) E ( RI )
where RI is the Rand Index, representing the percentage of correct predictions, defined as:
RI = T P + T N N ( N 1 ) / 2
Here, E ( ) is the expected value under random partitioning, max ( ) is the theoretical maximum matching value, and TP and TN are the number of sample pairs correctly placed in the same cluster and different clusters, respectively. ARI constrains the result to the interval [−1, 1] through normalization: ARI = 1 indicates perfect matching. ARI = 0 indicates consistency with random partitioning, and negative values indicate worse than random performance.

5.1.2. Metrics Based on Internal Structure

Silhouette Coefficient (SC): SC is used to measure the intra-cluster compactness and inter-cluster separation of clustering results. For a sample i, SC is calculated as follows:
SC ( i ) = d e x t e r ( i ) d i n t e r ( i ) max { d i n t e r ( i ) , d e x t e r ( i ) }
where d i n t e r ( i ) is the average distance from sample i to all other samples in the same cluster. d e x t e r ( i ) is the average distance from sample i to all samples in the nearest neighboring cluster. The mean of the silhouette coefficients across all samples is the overall SC. Its value ranges from [−1, 1]. A value closer to 1 indicates a clustering result with better intra-cluster compactness and inter-cluster separation.
Davies-Bouldin Index (DBI): This index calculates the average of the ratios between the “sum of within-cluster average distances” and the “distance between cluster centers” for all pairs of clusters. It is defined as:
DBI = 1 K i = 1 K max j i ( d ¯ i + d ¯ j d ( μ i , μ j ) )
where d ¯ i is the average distance of all samples in cluster i to its center μ i , and d ( μ i , μ j ) is the distance between cluster centers μ i and μ j . A smaller DBI value indicates more compact samples within clusters and better separation between clusters.

5.2. Baseline Methods

This article compares the following 7 baseline methods with ERL-DTC:
LCSS + KM: Computes trajectory similarity based on the Longest Common Subsequence (LCSS) between different trajectories and then employs k-means for clustering [40].
DTW + SC: Calculates the optimal alignment path between trajectories by constructing a spatiotemporal alignment distance matrix. Spectral clustering is then applied based on this similarity for effective trajectory partitioning [8].
Traj2vec + KM: Extracts trajectory movement features via a sliding window and encodes them into fixed-length vector representations using an LSTM encoder. The k-means algorithm is then used to cluster these vectors for motion pattern division [24].
ITraj2vec + KM: Improves Traj2Vec + KM by introducing a masking strategy to ignore invalid loss calculations from padded parts of variable-length sequences, thereby improving loss accuracy. Additionally, replaces the LSTM network with a BiLSTM network to strengthen bidirectional feature modeling capabilities.
TrajRCL + KM: TrajRCL obtains feature representations of trajectories through a Transformer-encoder network to reconstruct low distortion trajectory loss and self-supervised contrastive loss to guide model training, and uses the k-means algorithm to cluster the generated feature representations [14].
DTC: Utilizes a Seq2Seq autoencoder network for pre-training to obtain preliminary trajectory vector representations. During joint training, clustering loss is incorporated to co-optimize the learning of trajectory embedded representation and cluster centers. This represents the first deep trajectory clustering research oriented towards a clustering-friendly space [15].
DSTC: Achieves spatiotemporal trajectory representation learning and end-to-end clustering for a clustering-friendly space through density-based spatiotemporal token representation and periodic time encoding methods, combining Seq2Seq pre-training with clustering loss optimization [28].

5.3. Experimental Results

Conduct performance evaluation, ablation studies, and parameter sensitivity analysis on the method proposed in this article to verify its effectiveness. All experiments were implemented based on Python 3.9 and PyTorch 1.13, and were performed on a platform configured with an 11th Gen Intel(R) Core (TM) i7-11800H @ 2.30 GHz CPU, 16 GB RAM, and a GeForce GTX 3060 GPU. All evaluation metrics were computed as the mean and standard deviation from 10 independent runs under identical conditions.

5.3.1. Performance Evaluation

This section presents a comparative performance analysis between ERL-DTC and the baseline methods. For ERL-DTC, the embedding dimension was set to 256, the sliding window length and stride were set to 10 and 5, respectively. The BiLSTM-AE hidden layer size and number of layers were set to 64 and 2. The DualSFE used 4 attention heads and was stacked for 6 layers. Adam was used as the optimizer with a learning rate of 0.0001. The temperature hyperparameter τ was set to 0.07 [41], the maximum number of epochs was 30, and the batch size was set to 16. The other five deep clustering methods (including Traj2vec + KM, ITraj2vec + KM, TrajRCL + KM, DTC, and DSTC) all adopted the same parameter configuration as ERL-DTC. The point-matching-based clustering methods (LCSS + KM, DTW + SC) used their default parameter settings.
1.
Accuracy
First, the evaluation metrics from Section 5.1 were used to quantitatively assess the clustering accuracy performance of the proposed method. Table 1 shows the comparison of clustering results between the proposed method and the baseline methods on datasets with different numbers of clusters (e.g., K = 4 indicates the dataset contains 4 types of trajectory samples). The values in the table are in the format of “mean ± standard deviation”, with the best results highlighted in bold.
The experimental results in Table 1 show that deep trajectory clustering methods generally outperform point-matching-based methods in accuracy. The primary reason is that the latter are highly sensitive to noise and trajectory length, leading to mismatches and significantly degraded clustering performance for trajectories with substantial noise or varying lengths. However, as the number of clusters decreases, the performance gap between deep clustering and point-matching methods narrows while all evaluation metrics improve. This is because deep feature embedded representation may suffer from over-parameterization in simpler tasks, while traditional clustering methods are easier to optimize. Additionally, existing deep clustering methods may lack discriminative feature embedding due to inherent blindness in the feature extraction process.
In the four-class task (K = 4), the ACC of ERL-DTC reaches 0.7588 ± 0.013. This represents improvements of 32.1% and 28.5% compared to ITraj2vec + KM and TrajRCL + KM, respectively, and an approximately 14.1% improvement over the current state-of-the-art deep clustering framework, DSTC. NMI and ARI are improved by about 28.9% and 30.2%, respectively, demonstrating their significant advantage in clustering accuracy. Simultaneously, ERL-DTC achieves the highest SC (SC = 0.401 ± 0.009) and the lowest DBI (DBI = 1.62 ± 0.05), indicating a better fit for the clustering task. Furthermore, the standard deviations for all metrics of ERL-DTC are generally the lowest in the entire table, showcasing the best stability and robustness. To intuitively visualize the clustering effects of different methods, this paper employs t-SNE to perform a two-dimensional visualization of the output trajectory embedded representation, as shown in Figure 10.
In Figure 10, t-SNE reduces the high-dimensional feature embedded representation to a two-dimensional plane, where each point represents a trajectory and the color indicates its predicted category. In the visualization results from point-matching-based methods, points of different colors are severely intermingled, and cluster boundaries are extremely blurry. In comparison, while the ITraj2vec + KM and TrajRCL + KM methods show some improvement, they still fail to form distinct aggregation regions. This aligns with the lower SC and higher DBI results in Table 1, indicating that the embedded representation generated by these methods lacks sufficient discriminative power. The plots for the more advanced DTC and DSTC methods begin to show a certain trend of category aggregation. However, the centers of various clusters remain relatively close, and significant boundary overlap and scattered points exist, reflecting a clustering structure that is not sufficiently compact and clear. The four clusters obtained by ERL-DTC are distributed with larger inter-cluster spacing and tighter intra-cluster spacing in the embedding space, presenting a high-density “lump” shape. The boundaries between clusters are clear, with very few sample points in overlapping regions. This indicates that the embedded representation generated by ERL-DTC possesses good discriminative performance, guiding trajectories of the same class to aggregate and different classes to separate. Being well-suited to the clustering task, it enables more accurate trajectory pattern division. This benefit stems from ERL-DTC’s ability to effectively extract the most distinctive trajectory features across different spatiotemporal scales and construct multi-dimensional feature coupling relationships. Therefore, it is more suitable for mining patterns from trajectories with wide spatiotemporal distribution, large positional spans, and complex feature relationships, leading to more accurate discrimination results.
It is worth noting that the k-means clustering algorithm requires specifying the number of clusters beforehand. In this study, the “elbow method” [42] was used to determine the number of clusters.
2.
Computational Complexity
Next, we analyze the computational efficiency of different methods. Since deep trajectory clustering methods are trainable offline, and in practical applications, the time cost of encoding trajectory features into embedded representation and performing clustering tasks is typically of greater concern, this section focuses solely on the online computational complexity of the models. The online computation overhead of ERL-DTC primarily stems from the encoding processes of the TA-MAN and DualSFE modules. Let the trajectory length be L, the embedding dimension be d, and the length of the feature sequence after TA-MAN aggregation be s ( s L ). TA-MAN consists of a BiLSTM encoder and a two-level temporal attention aggregation. The time complexity for the BiLSTM to encode a single trajectory is O(L × d2), and the complexity for the local and global attention aggregation is O(s × d2). Since s L , the combined encoding complexity of TA-MAN is O(L × d2). DualSFE, based on a multi-head self-attention mechanism, processes the feature sequence aggregated by TA-MAN. Its attention computation complexity is reduced to O(s2 × d). Therefore, the overall time complexity of ERL-DTC is approximately O(L × d2 + s2 × d). Compared to the quadratic complexity O(L2) of traditional point-matching methods, ERL-DTC achieves an efficiency leap from quadratic to approximately linear order by reducing the computational sequence length from L to s through multi-scale aggregation. Furthermore, the matrix operations in the model are highly parallelizable, making them suitable for GPU acceleration.
The online computation times for different methods are shown in Table 2. For point-matching clustering methods, the time primarily includes point matching, feature calculation, and the clustering process. For deep clustering methods, the time mainly covers feature embedding and the clustering process. The values in the table are averages from 10 repeated calculations on datasets with a specified number of clusters under the same conditions.
As can be seen from the table, compared to traditional point-matching-based clustering algorithms, deep trajectory clustering methods exhibit an order-of-magnitude improvement in computational efficiency, which aligns with the theoretical analysis. Moreover, despite having a more complex model architecture, the inference time of ERL-DTC is on par with, or even better than, other deep clustering methods (such as DSTC). This is primarily due to TA-MAN shortening the sequence length through aggregation, allowing the subsequent, more computationally intensive attention operations in DualSFE to be performed on shorter sequences, thereby avoiding a sharp increase in computational cost.

5.3.2. Ablation Studies

To verify the effectiveness of each component in ERL-DTC, we further conducted ablation studies on both model design and training method design.
1.
Ablation Analysis of Trajectory Feature Embedding Model
This section compared ERL-DTC with 4 variants to explore the effectiveness and necessity of each module. The variant designs are as follows:
Without Multi-Layer Feature Aggregation (w/o MLF): This variant removed the temporal attention-based hierarchical aggregation (including local and global attention) process. The micro-level trajectory embedded representations encoded by BiLSTM-AE were directly fed into DualSFE for subsequent processing.
Without Motion Features (w/o MF): This variant used ordinary multi-head self-attention modules from Transformer instead of DualSFE and only encoded trajectory positional feature information.
Without Positional Features (w/o PF): This variant used ordinary multi-head self-attention modules to encode only trajectory motion feature information.
We conducted ablation experiments on datasets with different cluster numbers (K = 4, K = 3, K = 2) and evaluated clustering effectiveness using five evaluation metrics. The results are shown in Figure 11.
As can be seen from the figure, the full model (Ours) achieves optimal and relatively stable performance across different task complexities (K = 4, 3, 2). In the four-cluster task (K = 4), the three calculated metrics are significantly superior to all variant models, with its advantage gradually diminishing as the task complexity decreases. Compared to the full model (Ours), the variant without multi-layer feature aggregation (w/o MLF) exhibits the most pronounced performance degradation (ACC decreases by approximately 13.0% and SC decreases by about 21.2% in the four-cluster task). This fully demonstrates the crucial role of the TA-MAN in capturing temporal patterns at different hierarchical levels of trajectories. The performance losses of the ordinary multi-head self-attention models considering only single features (w/o MF and w/o PF) (ACC decreases by about 8.9% and 9.9%, and SC decreases by approximately 8.0% and 12.2% in the four-cluster task, respectively) verify the rationality and necessity of dual-feature fusion encoding. Furthermore, the results indicate that removing positional features (w/o PF) leads to a slightly greater performance decline than removing motion features (w/o MF), suggesting that positional information contributes slightly more to characterizing vessel activity patterns, but the two exhibit clear complementarity. The aforementioned results validate the rationality and necessity of the designs of TA-MAN and DualSFE. The full model is capable of learning trajectory feature embeddings with stronger discriminative power and superior structure, thereby significantly enhancing clustering performance, especially in complex multi-category, cross-scale trajectory analysis tasks.
2.
Ablation Analysis of Loss Function
This section primarily analyzes the impact of different loss functions on the trajectory clustering results through comparative analysis, as shown in Table 3. The “√“ indicates that the loss function setting includes this part of the loss, while the “ד indicates that the loss function is not included. It can be observed that combining the contrastive loss with the clustering losses (including the k-means loss Lkm, the soft assignment loss Lsoft, the inter-cluster distance loss Linter, and the neighborhood consistency loss Lnb) yields more desirable clustering results compared to using Lcont alone. In the four-class task, the ACC improvement is approximately 4.7%. The improvements in SC and DBI are even more significant, with SC increasing by 68.5% and DBI decreasing from 2.12 to 1.62, an improvement of about 23.6%. This indicates that the joint loss function not only enhances the alignment accuracy between clustering results and true labels but also optimizes the intrinsic structure of the feature embedding space, leading to superior intra-cluster compactness and inter-cluster separation.
As the clustering loss functions are enriched, the clustering performance gradually improves. This demonstrates that the four designed clustering loss functions are complementary in function and do not exhibit significant conflict in their optimization directions, proving the fundamental reasonableness of the clustering loss design. Combining all four clustering losses with the contrastive loss produced the best experimental results on the dataset, indicating that this design enables the model to learn feature representations that possess both discriminative and structural. Furthermore, the full model exhibits the smallest standard deviation on most metrics, suggesting that the multi-objective joint training strategy also enhances the model’s stability and robustness.

5.3.3. Parameter Sensitivity Analysis

This section conducts a parameter sensitivity analysis on four key hyperparameters of the ERL-DTC model: embedding dimension size d, number of DualSFE layers Num_layer, number of attention heads Num_h, and batch size Num_batch. Since the trends of metric changes were similar, we conducted experiments only on the four-cluster and three-cluster tasks for brevity. The results are shown in Figure 12, Figure 13, Figure 14 and Figure 15.
(1)
Influence of Embedding Dimension Size d
As shown in Figure 12, the model’s representational capacity first improves and then saturates as the embedding dimension increases. In the four-class task (K = 4), when d increases from 32 to 256, ACC improves by approximately 9.1%, and internal structural metrics improve simultaneously. This demonstrates that a larger embedding dimension provides a richer feature representation space, enabling better capture of complex trajectory patterns. However, when d is increased further, metrics like NMI show slight fluctuations, indicating that excessively large dimensions may lead to a degree of overfitting and training instability. Therefore, selecting d = 256 achieves an optimal balance between adequately encoding information and maintaining generalization capability.
(2)
Influence of Encoder Layer Number Num_layer
The depth of the encoder typically affects the model’s ability to handle complex feature interactions. As shown in Figure 13, in the four-class task, increasing the number of layers from 2 to 6 improves ACC by about 11.3% and SC by about 3.1%. This confirms that a deeper self-attention encoder structure can more effectively model the higher-order, non-linear coupling relationships between trajectory position and motion features. When the number of layers is increased further, metric improvements stagnate, suggesting that very deep networks may suffer from optimization difficulties due to excessive parameters. Concurrently, the 6-layer model demonstrated the most stable training process. Thus, Num_layer = 6 was determined to be the optimal configuration.
(3)
Influence of the Number of Attention Heads Num_h
As can be seen from Figure 14, the number of attention heads has a noticeable impact on the model’s feature extraction capability. As the number of heads increases, most metrics gradually improve and then tend to stabilize. Considering the trade-off between performance gains and computational cost, this paper finds that 4 heads are sufficient to capture the core interaction patterns of the multi-dimensional trajectory features, with diminishing marginal benefits from adding more heads. Therefore, selecting Num_h = 4 achieves the best balance between performance and efficiency.
(4)
Influence of Batch Size Num_batch
As shown in Figure 15, as the batch size is gradually increased, the metrics show steady improvement, indicating that larger batches provide more stable gradient estimates. However, larger is not always better, which differs from some conclusions in prior contrastive learning research [43]. This is because, within the joint training framework, clustering loss is considered alongside contrastive loss. The calculation of clustering loss heavily relies on the distribution of samples in the current batch. When the batch size is too large, a single batch may contain samples from multiple clusters, diluting and blurring the cluster centers, causing the centers computed based on the current batch to inaccurately reflect the global cluster structure. Furthermore, an excessively large batch size strengthens the contrastive loss’s effect of pushing all negative sample pairs apart, which may interfere with, or even overwhelm, the more fine-grained clustering optimization direction based on the current batch. This study needed to strike a balance between these two objectives, and a moderate batch size (Num_batch = 16) was found to be the optimal compromise.

6. Conclusions and Future Work

Aiming at the problems of difficult cross-spatiotemporal scale feature extraction for ship trajectories and complex coupling relationships between positional and motion features leading to suboptimal clustering performance, this paper proposes a deep trajectory clustering method named ERL-DTC. The method achieves multi-scale temporal feature aggregation through TA-MAN, accomplishes adaptive fusion of positional and motion features using DualSFE, and introduces joint optimization of the embedding space via contrastive learning and multi-objective clustering losses. It systematically improves the discriminative power and clustering-friendliness of trajectory embedded representations. Experiments validate the superiority of ERL-DTC. In the four-class task, its ACC reaches 0.7588, representing a 28.5% improvement over TrajRCL + KM and a 14.1% improvement over the current state-of-the-art DSTC. Simultaneously, ERL-DTC achieves the highest Silhouette Coefficient (SC = 0.401) and the lowest Davies-Bouldin Index (DBI = 1.62), proving its clustering results possess optimal intra-cluster compactness and inter-cluster separation. It indicates that the proposed method can more effectively capture cross-scale temporal patterns and multi-dimensional feature interactions in ship trajectories, forming more discriminative and stable clustering-friendly representations. Furthermore, the online inference time of ERL-DTC is improved by over two orders of magnitude compared to traditional point-matching methods, and is comparable or even superior to other deep clustering methods. This research can provide effective technical support for non-cooperative ship behavior analysis, route pattern mining, and identity inference in maritime surveillance, addressing the shortcomings of existing methods in analyzing trajectories with cross-regional, long-term, and multi-feature coupling characteristics.
However, this study still has certain limitations: the current method relies on unsupervised learning, leaving room for performance improvement in scenarios with fuzzy class boundaries or sample imbalance; the model structure is relatively complex, resulting in longer training times. Future work could consider introducing semi-supervised learning mechanisms, utilizing a small amount of labeled data to guide model optimization. Simultaneously, exploring lightweight network designs could further enhance the method’s practicality and scalability.

Author Contributions

Conceptualization, Y.L. and Z.S.; methodology, Y.L. and J.K.; software, B.F., J.K., X.W. and Y.L.; validation, Y.L., Z.S., X.W. and H.X.; formal analysis, B.F. and X.W.; investigation, X.W.; resources, Z.S.; data curation, B.F.; writing—original draft preparation, Y.L.; writing—review and editing, Z.S. and H.X.; visualization, Y.L.; supervision, Z.S.; project administration, B.F.; funding acquisition, B.F. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Data available on request due to restrictions of privacy.

Acknowledgments

The author wishes to express his gratitude to his supervisor for his guidance and to the other authors for their assistance in writing this manuscript.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
AbbreviationsDefinitions
ERL-DTCDeep ship trajectory clustering method based on feature embedded representation learning
TA-MANTemporal Attention-based Multi-scale feature Aggregation Network
DualSFEDual-feature Self-attention Fusion Encoder
BiLSTMBidirectional Long Short Term Memory Network
DFSAMDual-Feature Self-Attention Module
FFNFeedforward Neural Network
DTWDynamic Time Warping
LCSSLongest common subsequence
AISAutomatic Identification System
SOGSpeed Over Ground
COGCourse Over Ground
SymbolsDefinitions
T i , T i + The i-th ship trajectory, the augmented sample of the i-th ship trajectory
piThe i-th trajectory point
LTrajectory length (number of trajectory points)
NTotal number of trajectories
KNumber of clusters
dEmbedding Dimension Size
e i , e i + Feature embedded representation of trajectory T i and feature embedded representation of augmented trajectory T i +
μ k The k-th cluster center
Sim T i , T j The similarity between trajectory T i and T j
P E ( i ) Temporal positional encoding of trajectory point pi
Q, K, VQuery, Key, and Value Matrix in Attention Mechanism
γ Dual feature fusion weight parameter
τ Temperature parameter in contrastive loss
λ Balance coefficient in clustering loss

References

  1. Ljunggren, H. Using Deep Learning for Classifying Ship Trajectories. In Proceedings of the 2018 21st International Conference on Information Fusion (FUSION), Cambridge, UK, 10–13 July 2018; pp. 2158–2164. [Google Scholar]
  2. Szarmach, M.; Czarnowski, I. A Framework for Damage Detection in AIS Data Based on Clustering and Multi-Label Classification. J. Comput. Sci. 2024, 76, 102218. [Google Scholar] [CrossRef]
  3. Rong, Y.; Zhuang, Z.; He, Z.; Wang, X. A Maritime Traffic Network Mining Method Based on Massive Trajectory Data. Electronics 2022, 11, 987. [Google Scholar] [CrossRef]
  4. Li, Y.; Liu, Z.; Zheng, Z. Study on Complexity of Marine Traffic Based on Traffic Intrinsic Features and Data Mining. J. Comput. Methods Sci. Eng. 2019, 19, 1–15. [Google Scholar] [CrossRef]
  5. Guo, Z.; Qiang, H.; Xie, S.; Peng, X. Unsupervised Knowledge Discovery Framework: From AIS Data Processing to Maritime Traffic Networks Generating. Appl. Ocean Res. 2024, 146, 103924. [Google Scholar] [CrossRef]
  6. Wang, S.; Zhang, Y.; Zheng, Y. Multi-Ship Encounter Situation Adaptive Understanding by Individual Navigation Intention Inference. Ocean Eng. 2021, 237, 109612. [Google Scholar] [CrossRef]
  7. Yang, Y.; Liu, Y.; Li, G.; Zhang, Z.; Liu, Y. Harnessing the Power of Machine Learning for AIS Data-Driven Maritime Research: A Comprehensive Review. Transp. Res. Part E Logist. Transp. Rev. 2024, 183, 103426. [Google Scholar] [CrossRef]
  8. Li, H.; Lam, J.S.L.; Yang, Z.; Liu, J.; Liu, R.W.; Liang, M.; Li, Y. Unsupervised Hierarchical Methodology of Maritime Traffic Pattern Extraction for Knowledge Discovery. Transp. Res. Part C Emerg. Technol. 2022, 143, 103856. [Google Scholar] [CrossRef]
  9. Cai, Y.; Zhang, Z.; Cai, Z.; Liu, X.; Jiang, X. Hypergraph-Structured Autoencoder for Unsupervised and Semisupervised Classification of Hyperspectral Image. IEEE Geosci. Remote Sens. Lett. 2022, 19, 1–5. [Google Scholar] [CrossRef]
  10. Yao, D.; Hu, H.; Du, L.; Cong, G.; Han, S.; Bi, J. TrajGAT: A Graph-Based Long-Term Dependency Modeling Approach for Trajectory Similarity Computation. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, 14–18 August 2022; ACM: Washington, DC, USA, 2022; pp. 2275–2285. [Google Scholar]
  11. Guo, N.; Ma, M.; Xiong, W.; Chen, L.; Jing, N. An Efficient Query Algorithm for Trajectory Similarity Based on Fréchet Distance Threshold. ISPRS Int. J. Geo-Inf. 2017, 6, 326. [Google Scholar] [CrossRef]
  12. Cao, H.; Tang, H.; Wu, Y.; Wang, F.; Xu, Y. On Accurate Computation of Trajectory Similarity via Single Image Super-Resolution. In Proceedings of the 2021 International Joint Conference on Neural Networks (IJCNN), Shenzhen, China, 18–22 July 2021; IEEE: Shenzhen, China, 2021; pp. 1–9. [Google Scholar]
  13. Yao, D.; Zhang, C.; Zhu, Z.; Huang, J.; Bi, J. Trajectory Clustering via Deep Representation Learning. In Proceedings of the 2017 International Joint Conference on Neural Networks (IJCNN), Anchorage, AK, USA, 14–19 May 2017; IEEE: Anchorage, AK, USA, 2017; pp. 3880–3887. [Google Scholar]
  14. Li, S.; Chen, W.; Yan, B.; Li, Z.; Zhu, S.; Yu, Y. Self-Supervised Contrastive Representation Learning for Large-Scale Trajectories. Future Gener. Comput. Syst. 2023, 148, 357–366. [Google Scholar] [CrossRef]
  15. Wang, C.; Lyu, F.; Wu, S.; Wang, Y.; Xu, L.; Zhang, F.; Wang, S.; Wang, Y.; Du, Z. A Deep Trajectory Clustering Method Based on Sequence-to-sequence Autoencoder Model. Trans. GIS 2022, 26, 1801–1820. [Google Scholar] [CrossRef]
  16. Tedjopurnomo, D.A.; Li, X.; Bao, Z.; Cong, G.; Choudhury, F.; Qin, A.K. Similar Trajectory Search with Spatio-Temporal Deep Representation Learning. ACM Trans. Intell. Syst. Technol. 2021, 12, 1–26. [Google Scholar] [CrossRef]
  17. Xie, Z.; Bai, X.; Xu, X.; Xiao, Y. An Anomaly Detection Method Based on Ship Behavior Trajectory. Ocean Eng. 2024, 293, 116640. [Google Scholar] [CrossRef]
  18. Zhao, L.; Shi, G. A Trajectory Clustering Method Based on Douglas-Peucker Compression and Density for Marine Traffic Pattern Recognition. Ocean Eng. 2019, 172, 456–467. [Google Scholar] [CrossRef]
  19. Zhen, R.; Jin, Y.; Hu, Q.; Shao, Z.; Nikitakos, N. Maritime Anomaly Detection within Coastal Waters Based on Vessel Trajectory Clustering and Naïve Bayes Classifier. J. Navig. 2017, 70, 648–670. [Google Scholar] [CrossRef]
  20. Zhou, Y.; Daamen, W.; Vellinga, T.; Hoogendoorn, S.P. Ship Classification Based on Ship Behavior Clustering from AIS Data. Ocean Eng. 2019, 175, 176–187. [Google Scholar] [CrossRef]
  21. Chang, Y.; Qi, J.; Liang, Y.; Tanin, E. Contrastive Trajectory Similarity Learning with Dual-Feature Attention. In Proceedings of the 2023 IEEE 39th International Conference on Data Engineering (ICDE), Anaheim, CA, USA, 3–7 April 2023; IEEE: Anaheim, CA, USA, 2023; pp. 2933–2945. [Google Scholar]
  22. Chen, Y.; Yu, P.; Chen, W.; Zheng, Z.; Guo, M. Embedding-Based Similarity Computation for Massive Vehicle Trajectory Data. IEEE Internet Things J. 2022, 9, 4650–4660. [Google Scholar] [CrossRef]
  23. Li, X.; Zhao, K.; Cong, G.; Jensen, C.S.; Wei, W. Deep Representation Learning for Trajectory Similarity Computation. In Proceedings of the 2018 IEEE 34th International Conference on Data Engineering (ICDE), Paris, France, 16–19 April 2018; IEEE: Paris, France, 2018; pp. 617–628. [Google Scholar]
  24. Yao, D.; Zhang, C.; Zhu, Z.; Hu, Q.; Wang, Z.; Huang, J.; Bi, J. Learning Deep Representation for Trajectory Clustering. Expert Syst. 2018, 35, e12252. [Google Scholar] [CrossRef]
  25. Mikolov, T.; Chen, K.; Corrado, G.; Dean, J. Efficient Estimation of Word Representations in Vector Space. arXiv 2013, arXiv:1301.3781. [Google Scholar] [CrossRef]
  26. Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G.; Dean, J. Distributed Representations of Words and Phrases and Their Compositionality. arXiv 2013, arXiv:1310.4546. [Google Scholar] [CrossRef]
  27. Chen, Z.; Li, K.; Zhou, S.; Chen, L.; Shang, S. Towards Robust Trajectory Similarity Computation: Representation-Based Spatio-Temporal Similarity Quantification. World Wide Web 2023, 26, 1271–1294. [Google Scholar] [CrossRef]
  28. Wang, C.; Huang, J.; Wang, Y.; Lin, Z.; Jin, X.; Jin, X.; Weng, D.; Wu, Y. A Deep Spatiotemporal Trajectory Representation Learning Framework for Clustering. IEEE Trans. Intell. Transp. Syst. 2024, 25, 7687–7700. [Google Scholar] [CrossRef]
  29. Fang, Z.; Du, Y.; Chen, L.; Hu, Y.; Gao, Y.; Chen, G. E2 DTC: An End to End Deep Trajectory Clustering Framework via Self-Training. In Proceedings of the 2021 IEEE 37th International Conference on Data Engineering (ICDE), Chania, Greece, 19–22 April 2021; IEEE: Chania, Greece, 2021; pp. 696–707. [Google Scholar]
  30. Yao, D.; Cong, G.; Zhang, C.; Bi, J. Computing Trajectory Similarity in Linear Time: A Generic Seed-Guided Neural Metric Learning Approach. In Proceedings of the 2019 IEEE 35th International Conference on Data Engineering (ICDE), Macao, China, 8–11 April 2019; IEEE: Macao, China, 2019; pp. 1358–1369. [Google Scholar]
  31. Zhang, T.; Zhao, S.; Cheng, B.; Chen, J. Detection of AIS Closing Behavior and MMSI Spoofing Behavior of Ships Based on Spatiotemporal Data. Remote Sens. 2020, 12, 702. [Google Scholar] [CrossRef]
  32. Kong, W.; Cui, Y.; Peng, X.; Xiong, W.; Sun, W.; Gu, X.; Wang, Z.; Xia, S.; Dong, K.; Yu, H. Sea and Air Target Dataset Based on Self-reporting Position Trajectory Data. Signal Process. 2024, 40, 2085–2094. [Google Scholar]
  33. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. In Proceedings of the Advances in Neural Information Processing Systems; Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2017; Volume 30. [Google Scholar]
  34. Wang, B.; Shang, L.; Lioma, C.; Jiang, X.; Yang, H.; Liu, Q.; Simonsen, J.G. On Position Embeddings in BERT. In Proceedings of the International Conference on Learning Representations (ICLR), Online, 3–7 May 2021. [Google Scholar]
  35. Capobianco, S.; Millefiori, L.M.; Forti, N.; Braca, P.; Willett, P. Deep Learning Methods for Vessel Trajectory Prediction Based on Recurrent Neural Networks. IEEE Trans. Aerosp. Electron. Syst. 2021, 57, 4329–4346. [Google Scholar] [CrossRef]
  36. Gers, F. Learning to Forget: Continual Prediction with LSTM; IET: Edinburgh, Scotland, 1999; p. 855. [Google Scholar]
  37. Oord, A.; Li, Y.; Vinyals, O. Representation Learning with Contrastive Predictive Coding. arXiv 2018, arXiv:1807.03748. [Google Scholar]
  38. Aljalbout, E.; Golkov, V.; Siddiqui, Y.; Strobel, M.; Cremers, D. Clustering with Deep Learning: Taxonomy and New Methods. arXiv 2018, arXiv:1801.07648. [Google Scholar] [CrossRef]
  39. Kuhn, H. The Hungarian Method for the Assignment Problem. Nav. Res. Logist. Q. 2012, 2, 83–97. [Google Scholar] [CrossRef]
  40. Vlachos, M.; Kollios, G.; Gunopulos, D. Discovering Similar Multidimensional Trajectories. In Proceedings of the Proceedings 18th International Conference on Data Engineering, San Jose, CA, USA, 26 February–1 March 2002; pp. 673–684. [Google Scholar]
  41. Deng, L.; Zhao, Y.; Fu, Z.; Sun, H.; Liu, S.; Zheng, K. Efficient Trajectory Similarity Computation with Contrastive Learning. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management, Atlanta, GA, USA, 17–22 October 2022; ACM: Atlanta, GA, USA, 2022; pp. 365–374. [Google Scholar]
  42. Tibshirani, R.; Walther, G.; Hastie, T. Estimating the Number of Clusters in a Data Set Via the Gap Statistic. J. R. Stat. Soc. Ser. B Stat. Methodol. 2001, 63, 411–423. [Google Scholar] [CrossRef]
  43. Chen, T.; Kornblith, S.; Norouzi, M.; Hinton, G. A Simple Framework for Contrastive Learning of Visual Representations. arXiv 2020, arXiv:2002.05709. [Google Scholar] [CrossRef]
Figure 1. Trajectory clustering process.
Figure 1. Trajectory clustering process.
Jmse 14 00081 g001
Figure 2. Ship trajectory dataset.
Figure 2. Ship trajectory dataset.
Jmse 14 00081 g002
Figure 3. Statistical distribution of the SOG of the trajectory dataset. (a) Category A; (b) Category B; (c) Category C; (d) Category D.
Figure 3. Statistical distribution of the SOG of the trajectory dataset. (a) Category A; (b) Category B; (c) Category C; (d) Category D.
Jmse 14 00081 g003
Figure 4. Statistical distribution of the COG of the trajectory dataset. (a) Category A; (b) Category B; (c) Category C; (d) Category D.
Figure 4. Statistical distribution of the COG of the trajectory dataset. (a) Category A; (b) Category B; (c) Category C; (d) Category D.
Jmse 14 00081 g004
Figure 5. Framework overview of the ERL-DTC method.
Figure 5. Framework overview of the ERL-DTC method.
Jmse 14 00081 g005
Figure 6. Schematic diagram of BiLSTM.
Figure 6. Schematic diagram of BiLSTM.
Jmse 14 00081 g006
Figure 7. Temporary Attention based Multi scale Feature Aggregation Network.
Figure 7. Temporary Attention based Multi scale Feature Aggregation Network.
Jmse 14 00081 g007
Figure 8. Dual-Feature Self-Attention Fusion Encoder.
Figure 8. Dual-Feature Self-Attention Fusion Encoder.
Jmse 14 00081 g008
Figure 9. Schematic diagram of model training process.
Figure 9. Schematic diagram of model training process.
Jmse 14 00081 g009
Figure 10. Visualization of clustering results of different algorithms on datasets (K = 4). (a) LCSS + KM; (b) DTW + SC; (c) Traj2vec + KM; (d) ITraj2vec + KM; (e) TrajRCL + KM; (f) DTC; (g) DSTC; (h) ERL-DTC.
Figure 10. Visualization of clustering results of different algorithms on datasets (K = 4). (a) LCSS + KM; (b) DTW + SC; (c) Traj2vec + KM; (d) ITraj2vec + KM; (e) TrajRCL + KM; (f) DTC; (g) DSTC; (h) ERL-DTC.
Jmse 14 00081 g010aJmse 14 00081 g010b
Figure 11. Experimental results of ablation on different datasets. (a) ACC; (b) NMI; (c) ARI; (d) SC; (e) DBI.
Figure 11. Experimental results of ablation on different datasets. (a) ACC; (b) NMI; (c) ARI; (d) SC; (e) DBI.
Jmse 14 00081 g011
Figure 12. Sensitivity analysis of embedding dimension size. (a) ACC; (b) NMI; (c) ARI; (d) SC; (e) DBI.
Figure 12. Sensitivity analysis of embedding dimension size. (a) ACC; (b) NMI; (c) ARI; (d) SC; (e) DBI.
Jmse 14 00081 g012
Figure 13. Sensitivity analysis of encoder layers. (a) ACC; (b) NMI; (c) ARI; (d) SC; (e) DBI.
Figure 13. Sensitivity analysis of encoder layers. (a) ACC; (b) NMI; (c) ARI; (d) SC; (e) DBI.
Jmse 14 00081 g013
Figure 14. Sensitivity analysis of the number of attention heads. (a) ACC; (b) NMI; (c) ARI; (d) SC; (e) DBI.
Figure 14. Sensitivity analysis of the number of attention heads. (a) ACC; (b) NMI; (c) ARI; (d) SC; (e) DBI.
Jmse 14 00081 g014
Figure 15. Sensitivity analysis of batch size. (a) ACC; (b) NMI; (c) ARI; (d) SC; (e) DBI.
Figure 15. Sensitivity analysis of batch size. (a) ACC; (b) NMI; (c) ARI; (d) SC; (e) DBI.
Jmse 14 00081 g015
Table 1. Comparison of results of different trajectory clustering methods.
Table 1. Comparison of results of different trajectory clustering methods.
MethodLCSS
+ KM
DTW
+ SC
Traj2vec + KMITraj2vec + KMTrajRCL
+ KM
DTCDSTCERL-DTC
K = 4ACC0.4539
± 0.021
0.2735
± 0.025
0.5451
± 0.018
0.5745
± 0.017
0.5907
± 0.019
0.6553
± 0.015
0.6648
± 0.014
0.7588
± 0.013
NMI0.2570
± 0.018
0.0536
± 0.015
0.2771
± 0.012
0.3068
± 0.011
0.3272
± 0.017
0.4156
± 0.014
0.4148
± 0.013
0.5345
± 0.013
ARI0.1673
± 0.015
0.0007
± 0.019
0.2638
± 0.016
0.2992
± 0.015
0.2936
± 0.016
0.4209
± 0.018
0.4160
± 0.017
0.5416
± 0.015
SC0.102
± 0.012
0.058
± 0.018
0.201
± 0.011
0.218
± 0.010
0.235
± 0.015
0.315
± 0.013
0.322
± 0.012
0.401
± 0.009
DBI2.85 ± 0.113.42 ± 0.152.31 ± 0.102.22 ± 0.092.17 ± 0.111.98 ± 0.071.95 ± 0.071.62 ± 0.05
K = 3ACC0.6828
± 0.019
0.6355
± 0.022
0.5822
± 0.015
0.5964
± 0.016
0.6211
± 0.018
0.7266
± 0.013
0.7255
± 0.014
0.8462
± 0.015
NMI0.4045
± 0.015
0.2471
± 0.013
0.3075
± 0.011
0.2799
± 0.012
0.3207
± 0.018
0.4183
± 0.013
0.3826
± 0.014
0.6073
± 0.013
ARI0.3561
± 0.020
0.1651
± 0.018
0.2878
± 0.014
0.2952
± 0.015
0.3429
± 0.015
0.4268
± 0.016
0.4101
± 0.017
0.6139
± 0.015
SC0.185
± 0.010
0.121
± 0.009
0.223
± 0.009
0.231
± 0.010
0.237
± 0.013
0.378
± 0.012
0.385
± 0.011
0.458
± 0.011
DBI2.21 ± 0.092.65 ± 0.122.10 ± 0.082.05 ± 0.091.98 ± 0.111.76 ± 0.061.72 ± 0.061.41 ± 0.06
K = 2ACC0.9171
± 0.012
0.6948
± 0.020
0.8044
± 0.014
0.8122
± 0.013
0.8661
± 0.018
0.8951
± 0.014
0.9218
± 0.019
0.9202
± 0.011
NMI0.6115
± 0.022
0.3522
± 0.019
0.3672
± 0.016
0.3628
± 0.015
0.4574
± 0.018
0.5916
± 0.018
0.5925
± 0.017
0.6862
± 0.016
ARI0.6948
± 0.018
0.3084
± 0.022
0.3539
± 0.017
0.3681
± 0.016
0.5935
± 0.024
0.6893
± 0.015
0.6954
± 0.014
0.7008
± 0.014
SC0.301
± 0.015
0.158
± 0.011
0.342
± 0.013
0.351
± 0.012
0.364
± 0.016
0.412
± 0.014
0.425
± 0.013
0.487
± 0.014
DBI1.55 ± 0.082.08 ± 0.101.72 ± 0.071.68 ± 0.071.57 ± 0.061.42 ± 0.051.38 ± 0.051.21 ± 0.04
Table 2. Time used for trajectory feature encoding (second).
Table 2. Time used for trajectory feature encoding (second).
MethodK = 4K = 3K = 2
Point-matching-based methodsLCSS + KM2399.911502.29975.53
DTW + SC4344.572537.771204.26
Deep embedded representation-based methodsTraj2vec + KM43.8736.9527.95
ITraj2vec + KM64.0251.4940.75
TrajRCL + KM80.1965.3847.53
DTC20.8618.1214.61
DSTC32.1725.3619.77
ERL-DTC31.8323.8215.09
Table 3. Ablation study of ERL-DTC with different kinds of clustering loss.
Table 3. Ablation study of ERL-DTC with different kinds of clustering loss.
Loss Function SettingK = 4K = 3
LcontLkmLsoftLinterLnbACCNMIARISCDBIACCNMIARISCDBI
××××0.7246 ± 0.0190.5281 ± 0.0150.5161 ± 0.0210.238
± 0.022
2.12
± 0.13
0.7834 ± 0.0240.5962 ± 0.0180.5898 ± 0.0200.237
± 0.013
1.98
± 0.11
×××0.7274 ± 0.0150.5287 ± 0.0170.5282 ± 0.0180.304
± 0.015
1.97
± 0.09
0.8072 ± 0.0190.5983 ± 0.0150.5924 ± 0.0150.302
± 0.011
1.78
± 0.09
××0.7388 ± 0.0130.5324 ± 0.0150.5302 ± 0.0150.328
± 0.012
1.91
± 0.08
0.8237 ± 0.0160.5999 ± 0.0130.5993 ± 0.0130.375
± 0.013
1.62
± 0.08
×0.7483 ± 0.0120.5332 ± 0.0130.5333 ± 0.0140.393
± 0.009
1.74
± 0.06
0.8426 ± 0.0150.6014 ± 0.0140.6017 ± 0.0130.429
± 0.010
1.46
± 0.06
0.7588 ± 0.0130.5345 ± 0.0130.5416 ± 0.0150.401
± 0.009
1.62
± 0.05
0.8462 ± 0.0150.6073 ± 0.0130.6139 ± 0.0150.458
± 0.011
1.41
± 0.06
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Liu, Y.; Shi, Z.; Fu, B.; Ke, J.; Xu, H.; Wang, X. A Deep Ship Trajectory Clustering Method Based on Feature Embedded Representation Learning. J. Mar. Sci. Eng. 2026, 14, 81. https://doi.org/10.3390/jmse14010081

AMA Style

Liu Y, Shi Z, Fu B, Ke J, Xu H, Wang X. A Deep Ship Trajectory Clustering Method Based on Feature Embedded Representation Learning. Journal of Marine Science and Engineering. 2026; 14(1):81. https://doi.org/10.3390/jmse14010081

Chicago/Turabian Style

Liu, Yifei, Zhangsong Shi, Bing Fu, Jiankang Ke, Huihui Xu, and Xuan Wang. 2026. "A Deep Ship Trajectory Clustering Method Based on Feature Embedded Representation Learning" Journal of Marine Science and Engineering 14, no. 1: 81. https://doi.org/10.3390/jmse14010081

APA Style

Liu, Y., Shi, Z., Fu, B., Ke, J., Xu, H., & Wang, X. (2026). A Deep Ship Trajectory Clustering Method Based on Feature Embedded Representation Learning. Journal of Marine Science and Engineering, 14(1), 81. https://doi.org/10.3390/jmse14010081

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop