Predictive Modeling of Maritime Radar Data Using Transformers: A Survey and Research Agenda

Qesaraku, Bjorna; Steckel, Jan

doi:10.3390/jmse14030319

Open AccessReview

Predictive Modeling of Maritime Radar Data Using Transformers: A Survey and Research Agenda

by

Bjorna Qesaraku

¹ and

Jan Steckel

^1,2,*

¹

Cosys-Lab, Faculty of Applied Engineering, University of Antwerp, 2020 Antwerpen, Belgium

²

Flanders Make Strategic Research Centre, 3920 Lommel, Belgium

^*

Author to whom correspondence should be addressed.

J. Mar. Sci. Eng. 2026, 14(3), 319; https://doi.org/10.3390/jmse14030319

Submission received: 18 December 2025 / Revised: 30 January 2026 / Accepted: 4 February 2026 / Published: 6 February 2026

(This article belongs to the Section Ocean Engineering)

Download

Browse Figures

Versions Notes

Abstract

Maritime autonomous systems require robust predictive capabilities to anticipate vessel motion and environmental dynamics. While transformer architectures have revolutionized AIS-based trajectory prediction and demonstrated feasibility for sonar frame forecasting, their application to maritime radar frame prediction remains unexplored, creating a critical gap given radar’s all-weather reliability for navigation. This survey reviews predictive modeling approaches relevant to maritime radar, with emphasis on transformer architectures for spatiotemporal sequence forecasting, where existing representative methods are analyzed according to data type, architecture, and prediction horizon. Our review shows that, while the literature has demonstrated transformer-based frame prediction for sonar sensing, no prior work addresses transformer-based maritime radar frame prediction, thereby defining a clear research gap and motivating concrete research directions for future work in this area.

Keywords:

maritime radar; transformer architecture; frame prediction; autonomous vessels; deep learning; spatiotemporal modeling; collision avoidance; marine navigation; model-predictive perception

1. Introduction

The shipping industry has historically been conservative in adopting technological innovations. However, the recent technical innovations have increased the pressure to develop more capable perception systems, as autonomous and semi-autonomous vessels transition from research concepts to real operations. Recent surveys by Qiao et al. [1] and Thombre et al. [2] show that artificial intelligence, especially deep learning, has become central to maritime situational awareness, excelling in tasks such as object detection, collision risk assessment, and trajectory prediction. This shift toward autonomous vessels is driven by both safety and economic factors, as human error contributes to roughly 58 percent of maritime casualties in European waterways alone in the last decade [3]. At the same time, maritime transport remains the core of global commerce, carrying over 80 percent of world trade by volume, so even modest improvements in safety and efficiency can have a great economic impact [4]. Therefore, autonomous navigation systems are expected to also support more resilient and reliable shipping in this increasingly complex maritime environment, which requires perception systems that do more than merely describe the current scene: they must anticipate how the maritime environment will evolve over time.

Maritime radar data is characterized by a set of properties that make it both attractive and challenging for predictive modelling. Compared with optical cameras and LiDAR, radar remains reliable in fog, rain and rough seas, which makes it a key all-weather sensing modality for navigation and collision avoidance in challenging conditions [5]. In particular, X-band marine radars can cover several kilometers with typical update periods of about 1 to 2 s and spatial resolutions on the order of 5 to 10 m, providing dense temporal sampling over a wide area and therefore supporting long-horizon forecasting [6]. Nevertheless, marine radar images are sparse and heavily contaminated by sea clutter, leading to missed object detections as well as false alarms. Furthermore, the peculiar reflection regime of radar causes a signal structure that differs significantly from natural images on which many deep learning models are originally trained [7]. These peculiarities are further shaped by the complexity of the maritime scene itself, where vessels interact with other dynamic actors and static coastlines or ports continuously. Figure 1 illustrates a maritime scene containing multiple vessel types, navigation aids and buoys at varying ranges, that perception systems must simultaneously monitor and predict.

Current predictive methods for maritime radar data struggle to model long-range temporal dependencies in sparse, clutter-contaminated data, where sea clutter and signal artifacts significantly degrade detection performance [7,8,9]. For example, recurrent neural networks, despite their widespread use in maritime trajectory prediction [10], face vanishing gradients over extended sequences, limiting their effectiveness for long-horizon forecasting. In addition, while convolutional approaches are successful for spatial feature extraction and target detection [9,11], process fixed receptive fields cannot effectively capture the non-local spatiotemporal patterns inherent in vessel motion and environmental dynamics across large temporal spans. Conversely, transformer architectures, with their self-attention mechanisms, offer a promising alternative by enabling parallel processing of long sequences and direct modeling of dependencies across arbitrary time spans without the sequential bottleneck of recurrent networks [12]. Despite their advantages, transformer models are computationally demanding, as attention cost and memory grow rapidly with sequence and token length, and they typically require substantial training data and careful optimization to achieve robust generalization.

Recent surveys have examined maritime prediction from complementary perspectives. Concretely, Li et al. [10] analyze 64 unique trajectory prediction approaches, where 28 of the reviewed methods apply deep learning methods. In contrast, Xie et al. [13] review 16 transformer-based models used for mixed short-term and long-term trajectory prediction, where the source of data comes mostly from the Automatic Identification System (AIS), which is a maritime tracking and anti-collision tool that broadcasts vessel details such as identification, position, and course. Geng et al. [14] survey machine learning in radar signal processing, focusing on classification and clutter suppression. However, as illustrated in Figure 1, maritime scenes contain far more information than discrete vessel coordinates, yet no existing survey addresses the raw frame-level prediction for maritime radar. This represents a critical gap since autonomous systems require the capabilities of anticipating the evolution of the full scene and not just vessel trajectories.

This survey addresses the identified gap by reviewing the evolution of predictive modeling in maritime applications from traditional methods to deep learning, with particular emphasis on transformer-based approaches. The review examined the literature published between 2010 and 2025, focusing on maritime prediction methods, transformer architectures for spatiotemporal forecasting, and radar signal processing for navigation. The survey then explains why radar introduces distinct challenges relative to AIS coordinates and sonar imagery (i.e., an adjacent domain where the transformer has achieved a strong performance [15]), including computational scaling, polar geometry, and sea-clutter-dominated dynamics. Given the width of the field and the heterogeneity of radar modalities, the survey does not aim to provide exhaustive coverage of all predictive modeling methods; instead, it synthesizes representative work to highlight a key methodological gap and to motivate a focused research agenda on adapting transformer architectures from video prediction and sonar forecasting to maritime radar frame prediction.

2. Background

2.1. Maritime Radar Fundamentals

Maritime radar systems create two-dimensional images by emitting electromagnetic pulses and measuring the time delay, intensity, and frequency shift of reflected signals from objects in the environment. Through the analysis of the received signals, the radar systems can infer the bearing (position of antenna during scan rotation), the range (time-delay) and the radial velocity component (Doppler-shift) of the reflecting objects. Therefore, understanding the fundamental signal processing techniques of radars is essential for estimating the capabilities and challenges of radar frame prediction.

Range estimation is a technique whose fundamental principle lies in measuring the round-trip time

τ

between pulse transmission and echo reception, which determines the range R to reflecting objects according to

R = c τ / 2

, where c is the speed of light and the division by two accounts for the two-way propagation path. For a simple unmodulated pulse of duration T, the range resolution

Δ R = c T / 2

represents the minimum separation between two targets that can be distinguished. However, short-term pulses contain limited energy, which restricts the maximum detection range and requires high peak transmission power. In order to overcome the limitation of energy-resolution, modern radars employ pulse compression techniques that can achieve more precise detection over long distances by modulating the pulse frequency and applying matched filtering at reception [8]. The separation (i.e., the Rayleigh limit) will then be a function of the bandwidth of the received signal [16,17].

When a radar and a target are not at rest with respect to each other, the frequency

F_{r}

of the received echo will differ from the transmitted frequency

F_{t}

due to the Doppler effect. Doppler estimation in radar systems is the process of measuring this frequency shift to detect whether the target is approaching the radar (i.e., it has a slightly higher frequency) or moving away from it (i.e., it has a lower frequency), and to determine the component of velocity along the line of sight, known as the radial velocity [8].

Bearing estimation in radar systems refers to the process of determining the azimuth angle of a target relative to the radar’s reference direction, such as true North or the direction of the ship. This is achieved by measuring the angle in the horizontal plane when the radar beam intercepts the target. The accuracy of bearing measurement is influenced by factors such as antenna beam width, signal processing resolution, and the presence of clutter or interference. Mechanically rotating antennas, which are standard in maritime applications, achieve narrow beam widths through large physical apertures, but are limited to rotation rates of 20–40 RPM to allow sufficient dwell time on each bearing for an adequate signal-to-noise ratio. This yields frame rates of 0.3–0.7 Hz for 360° coverage or 1–2 Hz for sector scanning covering limited angular ranges [18].

For predictive modeling, these fundamental radar-signal processing methods impose advantages and challenges simultaneously. On the positive side, pulse compression offers high range resolution that supports detailed spatial representations of vessels and coastlines, enabling predictive models to learn from rich patterns. Moreover, Doppler estimation supplies explicit velocity data, allowing models to integrate motion vectors rather than calculating speed from position changes, and bearing estimation facilitates convolutional or patch-token approaches by imposing a regular spatial grid structure, typical in vision-transformer frameworks.

However, several constraints apply, e.g., the mechanical scanning of X-band radar requires three seconds to complete a full 360° rotation, resulting in a sensor update rate of approximately 0.3 Hz [19]. This relatively low frame rate creates significant temporal gaps between observations, which can disrupt continuous motion tracking and pose a challenge for smooth motion modeling. Furthermore, pulse-compression side lobes and Doppler blind speeds introduce ambiguous or missing features that predictive models must learn to handle. These challenges simply mean that model architectures must be tailored explicitly for radar data rather than directly importing those developed for natural-image or video domains.

2.2. Transformer Architectures for Prediction

Transformers [12], originally developed for natural language processing, have revolutionized sequence modeling through self-attention mechanisms that capture long-range dependencies without recurrence. Given an input sequence X, the self-attention computes

Attention (Q, K, V) = softmax (\frac{Q K^{⊤}}{\sqrt{d_{k}}}) V

(1)

where Q, K, V are learned projections named queries, keys, and values, respectively, and

d_{k}

is the key dimension [12]. This enables each element to attend to all others, learning complex spatiotemporal relationships.

Vision transformers (ViT) [20] adapt this modeling technique to images via patch tokenization by dividing images into non-overlapping patches and projecting them to embedding vectors. For video or sequence prediction, temporal transformers process sequences of frame embeddings, while spatiotemporal variants combine spatial and temporal attention [21]. What makes transformers particularly well suited for radar frame predictions is the ability of self-attention to capture long-range spatiotemporal dependencies across several past frames, enabling more accurate forecasts over extended horizons. In addition, processing all temporal tokens in parallel during training removes the sequential limitations of recurrent networks and makes it easier to use long radar histories and larger batch sizes effectively. Finally, the demonstrated success in sonar frame prediction using transformers in the EchoPT paper [15] suggests a natural path to the radar sensing modality.

2.3. Computational Challenges for Transformer Architecture

Applying transformer architectures to maritime radar data presents significant computational and structural challenges that must be addressed before effective frame prediction becomes feasible. Maritime radar operates in a native polar coordinate system with multi-dimensional signal structure [8,18], fundamentally different from the Cartesian image domains for which vision transformers were originally designed [20]. X-band radar images typically have resolutions of 1024 × 1024 to 2048 × 2048 pixels [6,19], which when divided into standard 16 × 16 patches (as in ViT [20]) produces 4096 to 16,384 tokens per frame. When processing sequences of 10–20 radar frames for temporal prediction, the total token count reaches 40,000 to 300,000, making the quadratic complexity of standard self-attention (O(N²)) [12] computationally prohibitive.

To address these challenges, radar-specific tokenization and attention mechanisms must be developed. Drawing from vision transformers [20] and acoustic sensing work [15], potential strategies include (1) patch tokens with bearing-aware positional embeddings [18], (2) explicit Doppler encoding through separate channels [8], (3) physics-informed encodings respecting ambiguity function constraints [8,16,17], and (4) hybrid spatial–velocity representations inspired by EchoPT [15]. The rapid growth of token counts needs efficient attention mechanisms such as sparse attention [22], deformable attention [23], or hierarchical tokenization [24] that exploit polar geometry and range-dependent resolution. Furthermore, radar’s physical ambiguity functions enable radar-informed inductive biases that restrict attention to physically plausible range–Doppler-bearing neighborhoods, reducing computational complexity while incorporating domain knowledge.

2.4. Advantages of Predicting Radar Frames

Existing maritime transformer models forecast discrete vessel positions from AIS or tracking data, achieving impressive long-horizon accuracy [10,25]. However, these coordinate-based models operate on sparse data that discard the rich spatial information present in radar observations, as shown in Figure 2. Alternatively, forecasting entire future radar images addresses critical perception capabilities that trajectory methods cannot provide. First, frame prediction preserves complete spatial context by generating 2D images that encode not only vessel positions but also coastlines, navigation aids, buoys, and small boats that are exempt from carrying AIS. This enhances the reasoning about safe passages and multi-vessel configurations that are essential for collision avoidance in crowded waterways.

Second, unlike trajectory methods that require vessels to be detected and tracked before prediction, frame prediction operates at the sensor observation level, providing advance warning in congested environments where new objects continuously appear. Moreover, as vessels maneuver, their radar returns change with viewing angle [8], so the frame prediction naturally captures these dynamics, supporting robust object tracking and classification even through maneuvers.

Finally, when detection fails due to heavy clutter or sensing equipment issues, trajectory methods lose their predictive capability, whereas frame-prediction methods can maintain situational awareness by generating expected radar observations. Therefore, calculating the differences between the expected frame and the observed frame enables sensing equipment anomaly detection at an early stage. In addition, frame prediction supports collision avoidance over a short horizon, while trajectory prediction methods can forecast minutes to hours ahead, making them more helpful in strategic route planning. To conclude, this complementary approach shows that frame prediction addresses fundamental capabilities that are essential for robust maritime autonomy, which coordinate-based methods alone cannot provide.

3. Literature Review

3.1. Traditional and Machine Learning Approaches in Maritime Applications

3.1.1. Traditional Methods

Early maritime prediction has primarily been used to predict the vessel trajectory and has relied on hand-crafted physics-based motion models with Gaussian noise assumptions. Perera and Guedes Soares [27] design a kinematic ship model and apply the Extended Kalman Filter to fuse radar or AIS data, enabling real-time state estimation and short-term trajectory prediction for collision avoidance. Moreover, Rong et al. [28] train a Gaussian model on AIS tracks to produce probabilistic forecasts of future positions using mean and variance. These models are lightweight, interpretable, and easy to integrate into existing navigation systems; however, their simplified dynamics make them operate poorly under abrupt maneuvers, multi-vessel interaction, and in complex maritime environments.

3.1.2. Machine Learning (ML) Methods

The predictive models that used machine learning approaches have relied extensively on AIS data. Liu et al. have proposed an ACDE-SVR model that combines adaptive chaos differential evolution with support vector regression to forecast vessel position, speed and heading from short AIS histories, achieving higher and more stable accuracy at low computational cost [29]. Furthermore, Zhang et al. build a random forest model on large-scale AIS data to predict vessel destinations by measuring the similarity between current and historical trajectories, reporting high classification accuracy in busy traffic waterways [30]. Even though these models have succeeded in capturing nonlinear patterns more accurately than traditional methods, they have still relied on manually engineered features, which was further improved with deep learning.

3.2. Deep Learning (DL) Approaches in Maritime Applications

3.2.1. CNN for Target Detection

Convolutional Neural Networks (CNN) excel at spatial feature extraction from radar imagery. For example, Chen et al. [31] fuse multi-domain features via CNNs for small target detection, and Qu et al. [9] employ an attention-enhanced CNN to capture and learn the deep features of Wigner–Ville distribution of the radar signals. However, pure CNNs lack temporal modeling since they process single frames without capturing motion dynamics that are necessary for prediction.

3.2.2. RNNs for Trajectory Prediction

Recurrent neural networks (RNNs) model temporal dependencies which make them suitable for sequence prediction. Wan et al. [32] propose Bi-LSTM using instantaneous phase, Doppler spectrum, and STFT features for sea clutter target detection, achieving an average detection probability of the sequence-feature detector of 0.955. For AIS trajectory prediction, Li et al. [10] review 28 DL approaches, predominantly Long Short-Term Memory (LSTM) variants, which have a high success rate in forecasting the path. However, when it comes to frame-level prediction, RNNs struggle with long-range dependencies and have difficulties in capturing spatial relationships within 2D frames.

3.3. Transformer-Based Approaches for Next-Frame Prediction

3.3.1. AIS Trajectory Prediction

AIS data provides precise vessel coordinates, enabling trajectory forecasting as a sequence-to-sequence problem. Zhou et al. [25] have proposed a transformer specifically for vessel trajectory prediction from AIS sequences. In their architecture, they have incorporated latitude, longitude, speed, and course in positional encoding, multi-head self-attention (8 heads) capturing vessel interaction patterns, temporal encoding via sinusoidal functions, and a decoder generating future positions autoregressively. They achieve 10 min ADE (Average Displacement Error) of 145 m vs. 267 m for LSTM baselines, showing a 45% improvement. The model handles up to 60 min predictions but degrades beyond 30 minutes due to accumulating errors. Its key limitation is that it operates on sparse coordinate sequences, not dense spatial 2D radar imagery. Furthermore, in their survey, Xie et al. [13] have analyzed 16 transformer variants for maritime tasks and have observed this common pattern: transformers consistently outperform RNNs for long-horizon trajectory prediction (>15 min) due to superior long-range dependency modeling; however, they all work with coordinate sequences rather than raw sensor imagery.

3.3.2. Object Detection with Transformers

While transformers have not been applied to radar frame prediction yet, they appear in target detection pipelines. He et al. [11] integrate transformer blocks into YOLOv5 for improved object detection. Their transformer mechanism patches multi-frame radar images into tokens, applies self-attention to learn inter-frame relationships, and enhances features fed to YOLO detector. Their architecture uses six transformer layers with eight attention heads and their results prove that transformers can process radar imagery effectively. They are able to detect objects in these radar frames as they move across frames; however, their system outputs bounding boxes encapsulating the objects. Extending the use of transformers to full radar frame prediction remains unexplored yet.

3.3.3. Sonar Frame Prediction—EchoPT

This is the most relevant prior work that demonstrates frame-level prediction feasibility for acoustic imaging. Steckel et al. [15] propose a transformer-based model capable of predicting 2D sonar images from historical frames and ego-motion. Their model uses individual sonar frames that are divided into

P \times P

patches (with

P = 16

), linearized, projected into 768-dimensional embeddings, augmented with learned positional encodings, and concatenated with separately embedded ego-motion of the robot (velocity and angular rate). A transformer encoder of 12 layers and 12 heads processes the sequence of past sonar frames

(F_{t - k}, \dots, F_{t - 1})

, and a linear decoder maps the output tokens back to the pixel space. The model is pre-trained on large-scale mobile-robot sonar data using a combination of MSE (Mean Squared Error) and perceptual losses, achieving high structural similarity for short prediction horizons and gradually degrading as error accumulates over long prediction horizons. To conclude, EchoPT shows that patch tokenization, explicit ego-motion conditioning, and perceptual loss terms are effective for sparse 2D sensing; however, it operates on dense, short-range, sonar images at higher frame rates. In contrast, maritime radar is slower, long-range and with different clutter dynamics, so adapting this architecture to radar frames requires handling larger temporal gaps, sea clutter, and larger-scale spatial structure.

3.4. Video Prediction Models

Transformer-based video prediction in computer vision offers useful templates for radar frame forecasting. VideoGPT [33] models videos in two stages: first, a 3D-convolutional VQ-VAE first compresses the video into discrete spatiotemporal latents, and then a GPT-style transformer autoregressively predicts future latent tokens with positional encodings, achieving competitive Fréchet Video Distance on datasets like BAIR and UCF-101. Alternatively, MaskViT [34] pre-trains transformers by masked visual modeling, using separate spatial and spatiotemporal window attention and iterative token refinement to generate high-resolution (256 × 256) future frames more efficiently than prior models, while supporting goal-conditioned prediction. Moreover, VPTR [35] introduces an efficient local spatiotemporal attention block and compares fully autoregressive, partially autoregressive, and non-autoregressive transformer variants, showing performance competitive with ConvLSTM baselines on standard video prediction benchmarks.

While video prediction transformers [33,34,35] provide valuable architectural templates, applying them to maritime radar requires addressing fundamental domain differences before the core techniques can be effectively transferred. Radar operates in polar coordinates with range-dependent resolution unlike Cartesian video frames and sea clutter exhibits K-distribution and log-normal statistics [8] rather than natural image textures. Furthermore, radar returns are sparse with high dynamic range [7,8] demanding specialized normalization and loss functions. In addition, maritime radar’s 1–2 Hz update rate [19] is an order of magnitude slower than video (24–30 Hz) [33,35], thereby changing temporal modeling requirements.

Despite these differences, three core principles from video prediction remain directly applicable with radar-specific adaptations. First, latent tokenization as demonstrated in VideoGPT allows transformers to model large, noisy radar frames in a compact representation, reducing the computational burden from the high token counts discussed in Section 2.3 while focusing on high-level spatial structure rather than pixel-level details. Second, efficient attention mechanisms as explored in VPTR [35] enable scalable modeling of long radar sequences by restricting attention to local spatiotemporal neighborhoods. Third, masked pre-training as demonstrated in MaskViT [34] offers a path to exploit large volumes of unlabeled radar data from the MOANA dataset and operational radar streams, improving sample efficiency by learning generic clutter and coastline structure before task-specific supervised training. These adaptations establish a concrete pathway from proven video prediction architectures to maritime radar frame forecasting.

4. Discussion and Future Directions

Table 1 provides a structured representation of the key methods covered in this survey, organized by perception modality, tasks such as trajectory prediction, detection and tracking, and architectures used to facilitate these tasks. For predictive approaches, the horizon of forecasting is also mentioned in a separate column. From the table, it is apparent that AIS is used almost exclusively for long-horizon trajectory or destination prediction, while radar is used for real-time object detection and tracking, with only one traditional EKF approach combining both. However, even though transformers have already been adopted for maritime tasks, there is, to date, no existing work on transformer-based prediction of future maritime radar frames, besides related formulations explored for optical video and sonar imagery [15].

Existing studies only partially address radar-specific challenges such as low frame rates, dynamic and state-dependent sea clutter, and polar sampling with range-dependent resolution. In addition, public datasets suitable for prediction are scarce, and most available maritime radar data focus on single-frame object detection rather than multi-frame sequences with reliable ground truth. Evaluation practice is also fragmented, since frame prediction is typically assessed with generic image metrics, while coordinate-based methods report trajectory errors that cannot be directly compared to frame-level metrics.

Despite these limitations, frame-level prediction offers capabilities that coordinate-only approaches cannot provide. Predicting future radar scenes would enable systems to reason about vessels that are not yet confidently tracked, anticipate changes as vessels maneuver, and model the evolution of clutter and other environmental conditions. Such capabilities are directly relevant to collision avoidance, anomaly detection, and anticipatory perception under degraded sensing. Additionally, the availability of the MOANA dataset [19] is a crucial enabler in this context, since it provides large-scale, multi-band maritime radar sequences with associated multimodal information, making it possible to train and systematically evaluate data-intensive deep models for frame prediction under diverse conditions and radar configurations.

Within this context, several focused research directions emerge to bridge the identified gap. Adapting transformers to maritime radar needs physics-aware designs that use radar knowledge in the model. For example, attention masks can follow range–Doppler ambiguity, positional encodings can reflect the polar (range–bearing) layout of radar, and a Doppler/velocity prediction branch with suitable losses can use motion data while keeping predictions consistent with vessel dynamics. Beyond single-modality approaches, hybrid methods combining trajectory prediction with frame forecasting offer complementary strengths, using cross-attention to condition pixel-level predictions on detected vessel coordinates from AIS or tracking systems and jointly optimizing both representations for complementary long-horizon planning and short-horizon spatial awareness.

For practical deployment, hierarchical tokenization strategies using range-dependent spatial resolution [24], sparse attention patterns following polar geometry [22,23], and knowledge distillation techniques are essential to achieve real-time performance on shipboard computing platforms. Moreover, self-supervised pre-training on MOANA [19] and operational radar streams provides a path to exploiting large volumes of unlabeled data, learning generic clutter and coastline structure before task-specific supervised training. Together, these directions transform the conceptual potential of transformer-based radar frame prediction into a technically achievable implementation pathway.

As we have argued before, Radar data, by virtue of its predominately specular reflection regime, is physically a very different sensing modality compared to optical techniques. Indeed, the relatively long wavelengths render the world virtually flat compared to the wavelength, promoting specular reflections over diffuse reflections. Furthermore, the direct possibility to sense the radial velocity component of the detected objects differentiate it even further from camera or LiDAR-based techniques. Therefore, we advocate that machine learning models for radar (and sonar, for that matter) take these physical properties explicitly into account. There are several recent examples where physically informed machine-learning models show great benefit in radar [36] and sonar [37]. These physical properties that should be taken into account include the range/Doppler/bearing ambiguity functions. Indeed, there exists a tradeoff between the resolutions in range, bearing, and radial velocity through a concept very similar to the Heisenberg uncertainty principle [38]. Additionaly, as radar data is most-naturally represented in a spherical coordinate system, vessel motion translates nonlinearly into this coordinate system. This has been shown extensively for sonar sensing [39,40,41], but we hypothesize that this is of equal importance for maritime radar data.

5. Conclusions

This survey has examined the landscape of predictive modeling for maritime perception, revealing a clear research gap: while transformers have demonstrated success in AIS trajectory forecasting and sonar frame prediction, their application to maritime radar frame-level prediction remains unexplored. Through comparative analysis of traditional signal processing, machine learning, and deep learning approaches across multiple maritime sensing modalities, we have established that frame-level prediction offers capabilities beyond coordinate-based methods, including spatial context preservation, untracked object anticipation, and environmental evolution modeling. It was discussed that radar-specific challenges need architectural adaptations beyond direct application of video or sonar models. In addition, the convergence of large-scale datasets (MOANA), proven transformer architectures (EchoPT for sonar), and efficient attention mechanisms creates favorable conditions for progress, where success requires addressing tokenization strategies, evaluation metrics, physics-informed architectures, and computational efficiency for real-time deployment.

Broader implications extend beyond technical contributions. Robust radar frame prediction can enhance maritime safety by providing anticipatory perception under degraded sensing conditions, support autonomous navigation in congested waterways without relying only on cooperative AIS broadcasts, and enable anomaly detection through predicted-observed discrepancies. As maritime autonomy transitions from research to deployment, the identified research agenda provides a concrete, technically achievable path toward more capable and resilient perception systems.

Future work should prioritize: (1) developing the first transformer-based radar frame prediction baseline on calibrated datasets, (2) establishing standardized evaluation protocols combining pixel-level and task-oriented metrics, (3) exploring physics-informed architectures that exploit radar signal structure, and (4) demonstrating real-time capability on vessel hardware. These steps will transform the identified research gap into validated solutions advancing the state of maritime autonomous perception.

Funding

This project was partially supported by the Flanders Make ASORE project.

Data Availability Statement

No new data were created or analyzed in this study. Data sharing is not applicable to this article.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

ADE	Average Displacement Error
AIS	Automatic Identification System
CNN	Convolutional Neural Network
DL	Deep Learning
EKF	Extended Kalman Filter
GP	Gaussian Process
LSTM	Long Short-Term Memory
ML	Machine Learning
MSE	Mean Squared Error
RNN	Recurrent Neural Network
RF	Random Forest
RPM	Revolutions Per Minute
STFT	Short-Time Fourier Transform
ViT	Vision Transformer

References

Qiao, Y.; Yin, J.; Wang, W.; Duarte, F.; Yang, J.; Ratti, C. Survey of Deep Learning for Autonomous Surface Vehicles in Marine Environments. IEEE Trans. Intell. Transp. Syst. 2023, 24, 3678–3701. [Google Scholar] [CrossRef]
Thombre, S.; Zhao, Z.; Ramm-Schmidt, H.; Vallet García, J.M.; Malkamäki, T.; Nikolskiy, S.; Hammarberg, T.; Nuortie, H.; Bhuiyan, M.Z.H.; Särkkä, S.; et al. Sensors and AI Techniques for Situational Awareness in Autonomous Ships: A Review. IEEE Trans. Intell. Transp. Syst. 2022, 23, 64–83. [Google Scholar] [CrossRef]
European Maritime Safety Agency. Annual Overview of Marine Casualties and Incidents 2024; European Maritime Safety Agency (EMSA): Lisbon, Portugal, 2024. [Google Scholar]
United Nations Conference on Trade and Development. Review of Maritime Transport 2024; United Nations: New York, NY, USA, 2024. [Google Scholar]
Ersü, C.; Petlenkov, E.; Janson, K. A Systematic Review of Cutting-Edge Radar Technologies: Applications for Unmanned Ground Vehicles (UGVs). Sensors 2024, 24, 7807. [Google Scholar] [CrossRef]
Neill, S.P.; Hashemi, M.R. In Situ and Remote Methods for Resource Characterization. In Fundamentals of Ocean Renewable Energy; Academic Press: Cambridge, MA, USA, 2018; Section 7.3.1, X-Band Radar. [Google Scholar]
Zhu, X.X.; Montazeri, S.; Ali, M.; Hua, Y.; Wang, Y.; Mou, L.; Shi, Y.; Xu, F.; Bamler, R. Deep Learning Meets SAR: Concepts, models, pitfalls, and perspectives. IEEE Geosci. Remote Sens. Mag. 2021, 9, 143–172. [Google Scholar] [CrossRef]
Richards, M.A. Fundamentals of Radar Signal Processing, 3rd ed.; McGraw-Hill Education: Singapore, 2022; Chapters 2 and 4. [Google Scholar]
Qu, Q.; Liu, W.; Wang, J.; Li, B.; Liu, N.; Wang, Y.-L. Enhanced CNN-Based Small Target Detection in Sea Clutter with Controllable False Alarm. IEEE Sens. J. 2023, 23, 10193–10205. [Google Scholar] [CrossRef]
Li, H.; Jiao, H.; Yang, Z. AIS data-driven ship trajectory prediction modelling and analysis based on machine learning and deep learning methods. Transp. Res. Part E Logist. Transp. Rev. 2023, 175, 103152. [Google Scholar] [CrossRef]
He, X.; Chen, X.; Du, X.; Wang, X.; Xu, S.; Guan, J. Maritime Target Radar Detection and Tracking via DTNet Transfer Learning Using Multi-Frame Images. Remote Sens. 2025, 17, 836. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. In Proceedings of the Advances in Neural Information Processing Systems 30 (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
Xie, Z.; Tu, E.; Fu, X.; Yuan, G.; Han, Y. AIS Data-Driven Maritime Monitoring Based on Transformer: A Comprehensive Review. In Proceedings of the 2025 International Joint Conference on Neural Networks (IJCNN), Rome, Italy, 30 June–5 July 2025; IEEE: New York, NY, USA, 2025; pp. 1–8. [Google Scholar] [CrossRef]
Geng, Z.; Yan, H.; Zhang, J.; Zhu, D. Deep-Learning for Radar: A Survey. IEEE Access 2021, 9, 141800–141818. [Google Scholar] [CrossRef]
Steckel, J.; Jansen, W.; Huebel, N. EchoPT: A Pretrained Transformer Architecture That Predicts 2D In-Air Sonar Images for Mobile Robotics. Biomimetics 2024, 9, 695. [Google Scholar] [CrossRef] [PubMed]
Van Trees, H.L. Detection, Estimation, and Modulation Theory, Part I: Detection, Estimation, and Linear Modulation Theory; John Wiley & Sons: New York, NY, USA, 1968. [Google Scholar]
Van Trees, H.L. Detection, Estimation, and Modulation Theory, Part II: Nonlinear Modulation Theory; John Wiley & Sons: New York, NY, USA, 1971. [Google Scholar]
Bole, A.; Wall, A.; Norris, A. Radar and ARPA Manual: Radar, AIS and Target Tracking for Marine Radar Users, 3rd ed.; Butterworth-Heinemann: Oxford, UK, 2014. [Google Scholar]
Jang, H.; Yang, W.; Kim, H.; Lee, D.; Kim, Y.; Park, J.; Jeon, M.; Koh, J.; Kang, Y.; Jung, M.; et al. MOANA: Multi-Radar Dataset for Maritime Odometry and Autonomous Navigation Application. arXiv 2024, arXiv:2412.03887. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale. arXiv 2021, arXiv:2010.11929. [Google Scholar] [CrossRef]
Bertasius, G.; Wang, H.; Torresani, L. Is Space-Time Attention All You Need for Video Understanding? In Proceedings of the 38th International Conference on Machine Learning (ICML), Virtual, 18–24 July 2021. [Google Scholar]
Child, R.; Gray, S.; Radford, A.; Sutskever, I. Generating Long Sequences with Sparse Transformers. arXiv 2019, arXiv:1904.10509. [Google Scholar] [CrossRef]
Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable DETR: Deformable Transformers for End-to-End Object Detection. In Proceedings of the 9th International Conference on Learning Representations (ICLR 2021), Virtual Event, 3–7 May 2021. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV); IEEE: New York, NY, USA, 2021. [Google Scholar]
Nguyen, D.; Fablet, R. A Transformer Network with Sparse Augmented Data Representation and Cross Entropy Loss for AIS-Based Vessel Trajectory Prediction. IEEE Access 2024, 12, 21596–21609. [Google Scholar] [CrossRef]
Clipper. Ais dcu Bridge. Wikimedia Commons. 2006. Available online: https://commons.wikimedia.org/wiki/File:Ais_dcu_bridge.jpg (accessed on 10 December 2025).
Perera, L.P.; Guedes Soares, C. Ocean Vessel Trajectory Estimation and Prediction Based on Extended Kalman Filter. In Proceedings of the Second International Conference on Adaptive and Self-Adaptive Systems and Applications (ADAPTIVE 2010), Lisbon, Portugal, 21–26 November 2010; pp. 14–20. [Google Scholar]
Rong, H.; Teixeira, A.P.; Guedes Soares, C. Ship trajectory uncertainty prediction based on a Gaussian Process model. Ocean Eng. 2019, 182, 499–511. [Google Scholar] [CrossRef]
Liu, J.; Shi, G.; Zhu, K. Vessel Trajectory Prediction Model Based on AIS Sensor Data and Adaptive Chaos Differential Evolution Support Vector Regression (ACDE-SVR). Appl. Sci. 2019, 9, 2983. [Google Scholar] [CrossRef]
Zhang, C.; Bin, J.; Wang, W.; Peng, X.; Wang, R.; Halldearn, R.; Liu, Z. AIS data driven general vessel destination prediction: A random forest based approach. Transp. Res. Part C Emerg. Technol. 2020, 118, 102729. [Google Scholar] [CrossRef]
Chen, X.; Su, N.; Huang, Y.; Guan, J. False-Alarm-Controllable Radar Detection for Marine Target Based on Multi Features Fusion via CNNs. IEEE Sens. J. 2021, 21, 9099–9111. [Google Scholar] [CrossRef]
Wan, H.; Tian, X.; Liang, J.; Shen, X. Sequence-Feature Detection of Small Targets in Sea Clutter Based on Bi-LSTM. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–11. [Google Scholar] [CrossRef]
Yan, W.; Zhang, Y.; Abbeel, P.; Srinivas, A. VideoGPT: Video Generation Using VQ-VAE and Transformers. arXiv 2021, arXiv:2104.10157. [Google Scholar]
Gupta, A.; Tian, S.; Zhang, Y.; Wu, J.; Martín-Martín, R.; Fei-Fei, L. MaskViT: Masked Visual Pre-Training for Video Prediction. In Proceedings of the International Conference on Learning Representations (ICLR 2023), Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
Ye, X.; Bilodeau, G.-A. Video Prediction by Efficient Transformers. Image Vis. Comput. 2023, 133, 104612. [Google Scholar] [CrossRef]
Zheng, R.; Sun, S.; Caesar, H.; Chen, H.; Li, J. Redefining Automotive Radar Imaging: A Domain-Informed 1D Deep Learning Approach for High-Resolution and Efficient Performance. arXiv 2024, arXiv:2406.07399. [Google Scholar] [CrossRef]
Jansen, W.; Steckel, J. SonoNERFs: Neural Radiance Fields Applied to Biological Echolocation Systems Allow 3D Scene Reconstruction through Perceptual Prediction. Biomimetics 2024, 9, 321. [Google Scholar] [CrossRef]
Klemm, R. Principles of Space-Time Adaptive Processing, 3rd ed.; The Institution of Engineering and Technology: London, UK, 2006. [Google Scholar] [CrossRef]
Peremans, H.; Steckel, J. Acoustic flow for robot motion control. In Proceedings of the 2014 IEEE International Conference on Robotics and Automation (ICRA 2014), Hong Kong, China, 31 May–7 June 2014; pp. 316–321. [Google Scholar] [CrossRef]
Jansen, W.; Laurijssen, D.; Steckel, J. Adaptive Acoustic Flow-Based Navigation with 3D Sonar Sensor Fusion. In Proceedings of the 2021 International Conference on Indoor Positioning and Indoor Navigation (IPIN 2021), Lloret de Mar, Spain, 29 November–2 December 2021; pp. 1–8. [Google Scholar] [CrossRef]
Steckel, J.; Peremans, H. Acoustic Flow-Based Control of a Mobile Platform Using a 3D Sonar Sensor. IEEE Sens. J. 2017, 17, 3131–3141. [Google Scholar] [CrossRef]

Figure 1. Annotated maritime scene showing multiple objects (small vessels, distant ships and navigation buoys and flags) that must be detected and classified by the perception system. Base photograph by Charlotte Clark.

Figure 2. Comparison of AIS and radar observations. (a) AIS display/control unit shows only cooperative vessels that can broadcast AIS messages like identity, voyage data, etc. Image by Clipper, “Ais dcu bridge”, 2006, Wikimedia Commons, licensed under CC BY 2.5 [26]. (b) Example of an X-band radar frame from MOANA dataset [19] containing the full radar scene, including shoreline returns and additional small echoes that may correspond to clutter, navigation aids, or buoys.

Table 1. Representative methods for maritime perception and prediction.

Method	Data	Task	Architecture	Horizon
Perera [27]	Radar/AIS	Trajectory prediction	EKF	∼min
Rong [28]	AIS	Trajectory prediction	GP	30–60 min
Liu [29]	AIS	Trajectory prediction	ACDE-SVR	≤30 min
Zhang [30]	AIS	Destination prediction	RF	Voyage
Chen [31]	Radar	Object detection	CNN	Real-time
Qu [9]	Radar	Object detection	Att.-CNN	Real-time
Wan [32]	Radar	Object detection	Bi-LSTM	Real-time
TrAISformer [25]	AIS	Trajectory prediction	Transformer	1–3 h
DTNet [11]	Radar	Detection and tracking	CNN + Transf.	Real-time
EchoPT [15]	Sonar	Frame prediction	Transformer	Few steps

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Qesaraku, B.; Steckel, J. Predictive Modeling of Maritime Radar Data Using Transformers: A Survey and Research Agenda. J. Mar. Sci. Eng. 2026, 14, 319. https://doi.org/10.3390/jmse14030319

AMA Style

Qesaraku B, Steckel J. Predictive Modeling of Maritime Radar Data Using Transformers: A Survey and Research Agenda. Journal of Marine Science and Engineering. 2026; 14(3):319. https://doi.org/10.3390/jmse14030319

Chicago/Turabian Style

Qesaraku, Bjorna, and Jan Steckel. 2026. "Predictive Modeling of Maritime Radar Data Using Transformers: A Survey and Research Agenda" Journal of Marine Science and Engineering 14, no. 3: 319. https://doi.org/10.3390/jmse14030319

APA Style

Qesaraku, B., & Steckel, J. (2026). Predictive Modeling of Maritime Radar Data Using Transformers: A Survey and Research Agenda. Journal of Marine Science and Engineering, 14(3), 319. https://doi.org/10.3390/jmse14030319

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Predictive Modeling of Maritime Radar Data Using Transformers: A Survey and Research Agenda

Abstract

1. Introduction

2. Background

2.1. Maritime Radar Fundamentals

2.2. Transformer Architectures for Prediction

2.3. Computational Challenges for Transformer Architecture

2.4. Advantages of Predicting Radar Frames

3. Literature Review

3.1. Traditional and Machine Learning Approaches in Maritime Applications

3.1.1. Traditional Methods

3.1.2. Machine Learning (ML) Methods

3.2. Deep Learning (DL) Approaches in Maritime Applications

3.2.1. CNN for Target Detection

3.2.2. RNNs for Trajectory Prediction

3.3. Transformer-Based Approaches for Next-Frame Prediction

3.3.1. AIS Trajectory Prediction

3.3.2. Object Detection with Transformers

3.3.3. Sonar Frame Prediction—EchoPT

3.4. Video Prediction Models

4. Discussion and Future Directions

5. Conclusions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI