1. Introduction
A typhoon is a low-pressure cyclone occurring over tropical or subtropical oceans. It behaves like a vortex in the atmosphere, causing surrounding air to rotate rapidly around its center and move along with the ambient atmospheric flow. Each year in summer and autumn, rising sea surface temperatures lead to evaporation and the formation of a low-pressure center. Air from surrounding high-pressure areas continuously flows toward the low-pressure center and, under the influence of the Coriolis force, forms a tropical cyclone. If the sea surface remains warm, the tropical cyclone continues to intensify, eventually developing into a powerful typhoon. When a typhoon makes landfall, the surrounding high-speed winds can cause injuries, property damage, and the uprooting of trees. Additionally, typhoons are often accompanied by heavy rainfall and secondary hazards such as landslides. According to statistics from the Typhoon Committee under the United Nations Economic and Social Commission for Asia and the Pacific (ESCAP) and the World Meteorological Organization (WMO), typhoons in China alone cause an average of approximately 505 deaths and 5.6 billion USD in economic losses annually [
1].
With the rapid development of deep learning, its potential in natural disaster prediction and meteorological forecasting has increasingly been demonstrated. In particular, for typhoon track and intensity prediction, deep neural networks have significantly improved accuracy thanks to the adoption of architectures such as convolutional neural networks (CNNs), recurrent neural networks (RNNs), and Transformers in recent years. To capture the temporal characteristics of meteorological sequences, RNNs are commonly employed. For instance, Chao et al. (2024) utilized LSTM models with offshore buoy data to enhance forecast accuracy [
2]. Park et al. (2024) proposed the TPTNet model [
3], which treats multi-station spatiotemporal data as spatial climate fields and integrates CNNs, Transformers, and graph neural networks, achieving superior performance in short-term forecasting tasks compared to traditional numerical weather prediction. However, such methods are typically based on point data recorded by meteorological observation stations, resulting in a limited receptive field and difficulties in capturing long-range dependencies and global variations in complex meteorological systems.
To address this issue, many researchers have attempted to combine remote sensing meteorological imagery with station observation data to construct multi-modal temporal inputs, and multi-modal models are then employed for joint modeling. For example, Zhou et al. (2020) proposed the GC-LSTM model [
4], which integrates CNNs and LSTMs to model satellite cloud imagery, improving categorical forecast accuracy. Xu et al. (2022) proposed SAF-Net [
5] and AM-ConvGRU [
6]; SAF-Net employs a wide-and-deep dual-path structure for joint modeling of typhoon features, while AM-ConvGRU introduces residual channel attention and multi-scale convolutions, achieving lower path prediction error. Qin et al. (2022) developed the Trj-DMFMG model [
7], which combines multi-modal fusion and multi-task generative modules to enhance multi-source data modeling capabilities. He et al. (2024) proposed ACFN [
8], incorporating attention mechanisms into convolutional structures to enhance key feature extraction and fusion. Park et al. (2024) developed the Transformer-based LT3P model [
9], which excels in full-time-step forecasting, especially in short-term predictions. Tian et al. (2024) designed the lightweight dual-branch AWL-Net [
10], improving multiple accuracy metrics under low computational cost. Qiao et al. (2024) proposed AT-ResNeXt-50 [
11], integrating self-attention mechanisms to enhance recognition in complex scenarios. Ren et al. (2025) introduced the dual-encoder spatiotemporal fusion model DESF-Typhoon [
12], which significantly outperforms existing methods in both path prediction accuracy and year-round stability. However, meteorological disasters often exhibit broad spatial impacts and significant regional variability, and point-based predictions alone cannot fully support the assessment and mitigation of such disasters.
To simulate regional future meteorological evolution trends, some studies attempt to learn meteorological evolution patterns and generate corresponding future meteorological images. For example, Andrychowicz et al. (2023) proposed MetNet-3 [
13], introducing a key densification mechanism to enhance spatially dense prediction capabilities, showing significant potential in short-term high-resolution forecasting. Gao et al. (2023) proposed the diffusion generative model PreDiff [
14], combining Transformer and U-Net architectures and introducing a physics-guided mechanism to enhance the physical consistency of the model. Hu et al. (2023) proposed SwinVRNN [
15], a variational RNN model based on Swin Transformers, achieving high forecast accuracy while maintaining reasonable ensemble diversity. Ling et al. (2024) proposed the CDDPM model [
16], combining convolution and upsampling convolution structures to improve generative accuracy and perceptual quality. Ren et al. (2024) introduced SAM-Net [
17] by incorporating a self-attention memory module into PredRNN-v2. Lin et al. (2024) proposed the multivariate spatiotemporal hybrid convolution-attention network StHCFormer [
18], excelling in wind evolution feature modeling. Kochkov et al. (2024) developed NeuralGCM [
19], a hybrid model integrating physical modeling and machine learning methods. Li et al. (2025) proposed SwinNowcast [
20], a deep learning model based on the Swin Transformer architecture, achieving low false alarm rates in regional precipitation forecasting. Xu et al. (2025) proposed the Fourier near-term forecasting model FourCastLSTM [
21], significantly improving B-MAE and B-MSE metrics in precipitation nowcasting tasks. While these models enable relatively accurate and comprehensive small-scale regional meteorological image forecasting, they still fall short of providing a full understanding of global climate evolution.
In recent years, with substantial improvements in computational power and the availability of high-quality real-time meteorological datasets, some research teams have begun training large-scale meteorological models using supercomputers. For example, Zhang et al. (2023) proposed NowcastNet [
22], which consists of an evolution network and a generation network, incorporating physics-based evolution mechanisms to improve prediction accuracy. Bi et al. (2023) introduced Pangu-Weather [
23], a Transformer-based model integrating hierarchical prediction and physical priors for fine-grained modeling of multi-level atmospheric variables. Chen et al. (2023) developed FengWu [
24], an advanced data-driven global medium-range weather forecasting system, demonstrating superior performance in 80% of 880 predicted meteorological variables while reducing errors in long-term forecasts. Lam et al. (2023) proposed GraphCast [
25], a graph neural network-based model capable of efficiently predicting multiple meteorological variables over the next 10 days in one minute. Niu et al. (2024) developed Pangu_SP [
26], improving the stability of typhoon track and intensity prediction through a spectral perturbation mechanism. Bodnar et al. (2025) proposed Aurora [
27], a large foundational model trained on one million hours of multi-source geospatial data, outperforming forecasts from seven meteorological centers in high-resolution weather prediction. Despite their excellent performance, these models require substantial computational resources for training and inference, which limits their deployment on most computing platforms.
In summary, although deep learning models have achieved significant progress in typhoon track prediction, local meteorological evolution modeling, and global weather forecasting, existing approaches still face challenges such as insufficient multi-modal feature fusion, limited temporal and spatial evolution modeling, or excessive computational demands. To address these issues, this paper proposes a dual-path, multi-modal typhoon track prediction model that employs a gated axial Transformer to capture deeper structural features and a dual-branch prediction architecture to strengthen temporal dependencies and spatial evolution modeling, thereby improving the model’s predictive performance.
2. Materials and Methods
2.1. Data
This study utilized the Typhoon Best Track dataset released by the China Meteorological Administration (CMA) and the CDAS (Climate Data Assimilation System) data provided by the Climate Forecast System Version 2 (CFSv2) developed by the National Centers for Environmental Prediction (NCEP) in the United States. The CMA dataset, published by the Tropical Cyclone Data Center of the CMA, contains six-hourly typhoon position and intensity data over the Northwest Pacific since 1949. The dataset has been corrected and integrated by multiple meteorologists from various data sources (Ying et al., 2014 [
28]; Lu et al., 2021 [
29]). Due to its high accuracy, this dataset is considered the best record of tropical cyclones. The variables contained in the dataset and their descriptions are listed in
Table 1. In this study, the best-track position data provided by the CMA dataset were used to determine the typhoon center locations. Pressure and wind speed were selected as input variables, while datetime information and typhoon intensity level were not used. No external vortex detection algorithms were applied in this work.
The CFSv2 provided by NCEP is an integrated atmosphere-ocean-land coupled prediction system, designed to generate high-precision analysis fields and seasonal-to-interannual climate forecasts. Compared with its predecessor CFS, CFSv2 has significantly improved spatial resolution, vertical levels, data assimilation methods, and coupling mechanisms. Its Climate Data Assimilation System (CDAS) automatically integrates multi-source observational data from satellites, ground stations, and ocean buoys worldwide to generate high-quality reanalysis data. The system has been operational four times daily since 2011 (at 0, 6, 12, and 18 UTC), with a spatial resolution of 0.5° for atmospheric variables and 0.25° for oceanic variables (Saha et al., 2014 [
30]). Thanks to its long temporal coverage, rich variable types, and multiple vertical levels, CDAS data from CFSv2 have been widely applied in climate monitoring, model-driven studies, and tropical cyclone trajectory analysis. In this study, the isobaric surface data from the CDAS dataset were used. These data include wind components, temperature, humidity, and circulation-related parameters, which collectively characterize the dynamic and thermodynamic properties of the atmospheric environment influencing typhoon movement. Therefore, these variables were selected as meteorological environmental input features for the model. In addition, these variables are consistently available in the CDAS dataset and have been widely used in previous tropical cyclone studies. The names of the selected variables and their descriptions are summarized in
Table 2.
In this study, tropical cyclone data from 2011 to 2024 were selected as the research subjects. For the CMA dataset, observations at 6 h intervals starting from 00:00 daily were used, i.e., 00:00, 06:00, 12:00, and 18:00 each day. For the CDAS dataset, in order to align the time and coordinates with the CMA dataset, the reanalysis meteorological fields corresponding to the same timestamps (00:00, 06:00, 12:00, and 18:00) in the CDAS were extracted. Based on the typhoon center coordinates at each time step from the CMA dataset, a 24° × 24° region centered on the typhoon location was selected from the CDAS dataset as the input data. Through this approach, the meteorological field data and typhoon track data remain strictly aligned in time, with no additional time lag. The data resolution was 0.5° × 0.5°, and multiple isobaric levels (225 mbar, 500 mbar, and 750 mbar) were used as input feature maps.
Furthermore, to address the issue of insufficient typhoon samples and the difficulty for models to fully learn atmospheric evolution features, an atmospheric evolution dataset was randomly constructed under the same spatial range, resolution, and isobaric layer configuration as the typhoon feature maps, and was used for model pretraining. The two datasets used in this study are illustrated in
Figure 1. In this study, the first five time steps were used as model inputs, and the subsequent four time steps were used as prediction targets.
In addition, to handle the differences in data structures between the CMA and CDAS datasets and the large differences in value ranges among different feature maps in the CDAS dataset, min–max normalization was applied to all data. In the CMA dataset, tropical cyclone positions are recorded using latitude and longitude coordinates, where the longitude ranges from 10° N to 78° N and the latitude from 102.5° E to 106.3° W. Since the model predicts changes in latitude and longitude, which are mostly less than 10°, the input data range is relatively large. Therefore, min–max normalization was applied to the CMA inputs.
For the CDAS dataset, the value ranges of different feature maps vary greatly: some feature maps contain values on the order of tens of thousands, while others only have values in the single digits. Such large discrepancies could lead the model to ignore features with smaller values. Therefore, min–max normalization was applied separately to each feature map to ensure all inputs were scaled appropriately.
2.2. Method
With the introduction of CDAS reanalysis data, the diversity of feature maps contained in the dataset increases significantly. To enable the model to perceive and integrate multiple types of meteorological information, we combine a gated axial Transformer with other neural network components to construct a dual-branch model termed the Typhoon-Gated Axial Transformer (TGAT). This model is capable of effectively fusing multi-channel data and improving the prediction accuracy of typhoon intensity over the next 24 h. The overall architecture of the TGAT model is illustrated in
Figure 2.
As shown in
Figure 2, the proposed model consists of four main components: the Atmospheric Reanalysis Feature Encoding Module, the Typhoon Core Dynamics Encoding Module, the Typhoon Track Prediction Branch, and the Environmental Field Prediction Branch. The typhoon core dynamics encoding module feeds the input historical track data into an LSTM network and outputs a one-dimensional vector of length 48 that represents the typhoon’s movement tendency. The atmospheric reanalysis feature encoding module is responsible for extracting features from the input CDAS environmental feature maps, and its final output is a feature tensor with a shape of (512, 3, 3).
In the typhoon track prediction branch, the feature maps generated by the atmospheric reanalysis feature encoding module are first flattened and then fed into a Transformer module. The Transformer output is subsequently fused with the output of the typhoon core dynamics encoding module and passed to a decoder composed of an LSTM and fully connected layers to generate the final track prediction. In the environmental field prediction branch, a recurrent module consisting of linear layers and an LSTM first transforms the feature maps from five time steps to four time steps. The transformed features are then spatially reconstructed through convolutional upsampling to generate predicted future meteorological environment maps, enabling the model to be pretrained using meteorological data.
In the following sections of this chapter, we provide a detailed description of the atmospheric reanalysis feature encoding module and elaborate on the implementation mechanisms and functional roles of the LSTM and Transformer modules.
2.2.1. Long Short-Term Memory Network
The historical trajectory data of tropical cyclones is a multi-feature sequential dataset containing hidden temporal information, such as typhoon movement direction and intensity trends. To enable the model to capture temporal dynamics, researchers often employ RNNs or their variants to process such data. In this study, we adopt a Long Short-Term Memory (LSTM) network to process the historical trajectories of tropical cyclones.
The structure of an LSTM unit is illustrated in
Figure 3. Compared with standard recurrent neural networks, its key feature is the presence of a cell state and three gating mechanisms: the input gate, forget gate, and output gate. The LSTM is responsible for preserving important information from previous time steps. The forget gate
controls the retention of the previous cell state
; the input gate
regulates the incorporation of the current candidate cell state
; the current cell state
is obtained by combining
and
; finally, the output gate
determines the hidden state
at the current time step. The computation formulas are as follows:
Compared with other recurrent neural networks, LSTM can more effectively handle long-term dependencies. Its unique gating mechanisms enable the network to learn how to selectively retain or discard information in the cell state, thus mitigating issues such as information loss over long sequences and the vanishing gradient problem.
2.2.2. Reanalysis Data Encoder Branch
The reanalysis data encoder branch consists of a CNN-based feature extraction layer, a global–local gated axial Transformer module, and a Transformer-based feature extraction layer. The CNN feature extraction layer is composed of a single convolutional layer, which performs preliminary feature extraction and adjusts the spatial shape and channel dimensions of the input images to facilitate subsequent processing.
To enable the model to simultaneously capture global and local information, a global–local gated axial Transformer module is employed to process the image data. This module processes the input through two parallel branches, namely a global branch and a local branch, and the final output is obtained by concatenating the outputs of both branches. The two branches are described as follows:
Global branch: This branch directly applies the gated axial Transformer to the entire input image to model long-range dependencies and capture global contextual information.
Local branch: In this branch, the input image is divided into nine non-overlapping patches of size , where denotes the spatial dimension of the original image. A gated axial Transformer is then applied to each patch independently. Finally, the outputs from the nine spatial locations are concatenated to form the output of the local branch.
Subsequently, the output of the global–local gated axial Transformer module is flattened into a one-dimensional vector and fed into a multi-layer Transformer module for further feature extraction, yielding the final output of the reanalysis data encoder branch.
2.2.3. Self-Attention Mechanism and Transformer
Self-attention is a key technique used to model dependencies between different positions within an input sequence. It maps the input features into
queries (Query),
keys (Key), and
values (Value) through three learnable linear transformations, which can be formulated as follows:
where
,
, and
are trainable parameters, and
,
, and
denote the query, key, and value at an arbitrary spatial position
in the image, respectively.
When an input image
with height
, width
, and
channels is provided, the self-attention layer computes an output
The self-attention operation is formulated as follows:
As shown in the above formulation, the attention weights are determined by the similarity between and , both of which are dynamically generated from the input x. This property enables the self-attention mechanism to adaptively adjust its weights according to different inputs, allowing each spatial position in an image to capture global contextual information.
Building upon the self-attention mechanism, Vaswani et al. (2017) [
31] proposed the Transformer architecture, which stacks multiple self-attention layers and feed-forward neural networks. In particular, the multi-head attention mechanism employs multiple independent attention heads to compute attention in parallel, with each head learning different feature distributions. Owing to the introduction of multi-head attention, Transformers are capable of effectively modeling diverse dependency patterns within sequences and capturing deep contextual relationships. In this study, Transformer modules are also adopted for feature extraction.
2.2.4. Gated Axial Attention Mechanism
With the strong representation capability demonstrated by Transformer architectures, researchers have increasingly explored their application in computer vision tasks. Dosovitskiy et al. (2021) [
32] proposed the Vision Transformer (ViT), which partitions an image into patches, flattens them, and feeds them into a Transformer to extract image features. Studies have shown that when trained on sufficiently large datasets, ViT can outperform traditional convolutional neural networks. However, despite its advantages—such as parallel computation and powerful modeling capacity—ViT suffers from several limitations, including high computational complexity, large memory consumption, and insufficient capability to explicitly capture spatial positional information. Moreover, ViT tends to overfit on small-scale datasets, leading to inferior performance compared with conventional CNNs in data-limited scenarios.
To address the high computational cost of self-attention and its limited ability to capture positional information, Wang et al. proposed an axial attention mechanism with relative positional encoding [
33]. Specifically, axial attention abandons full 2D attention computation at each spatial location and instead decomposes attention along the height and width axes independently.
The gated axial attention module adopted in this study is illustrated in
Figure 4. As shown, the module consists of a
convolutional layer, a normalization layer, and multiple layers of gated multi-head axial attention operating along horizontal and vertical directions. The
convolution is mainly responsible for channel projection and dimensional adjustment, the normalization layer stabilizes the training process, and the stacked gated axial multi-head attention layers efficiently extract long-range dependency features from the image while significantly reducing computational overhead.
In addition, Wang et al. [
33] introduced positional bias terms into the queries (Q), keys (K), and values (V), enabling the attention mechanism to more accurately capture positional information within the sequence. While their approach addresses the issues of high computational cost and the lack of positional awareness in self-attention, their experiments were conducted on relatively large datasets. However, studies have shown that although self-attention mechanisms can effectively capture data features when trained on large-scale datasets, they may fail to realize their advantages on smaller datasets. This limitation is mainly attributed to the difficulty of learning effective positional encodings in small-scale datasets, which leads to reduced accuracy in encoding long-range dependencies.
To address the poor performance of self-attention mechanisms on small-scale datasets, Valanarasu et al. proposed the gated axial attention module [
34], which incorporates multiple learnable gating parameters to control the influence of positional encodings in global context modeling. The structure of the gated attention mechanism is illustrated in
Figure 5.
Here,
,
,
, and
are learnable gating parameters. They can learn the relative importance of the positional encodings at different locations, assigning higher weights to positions where the positional encoding is more accurate. The computation formula for the axial attention with the gated mechanism is as follows:
2.2.5. Environmental Field Prediction Branch
The environmental field prediction branch consists of a temporal transformation module and a convolutional upsampling module. The temporal transformation module is responsible for extracting and reconstructing features along the temporal dimension to match the input requirements of subsequent modules. To prevent an excessive number of model parameters, a fully connected layer is first applied to scale the feature dimensions, followed by an LSTM network to reconstruct the features along the temporal dimension.
The convolutional upsampling module adopts a progressive upsampling strategy to spatially reconstruct the features generated by the temporal transformation module. The detailed architecture is illustrated in
Figure 6. In this module, two image upsampling blocks are employed, each of which first applies convolutional layers for feature extraction, followed by bilinear interpolation to increase the spatial resolution. Bilinear interpolation is adopted because it effectively mitigates the mosaic effect, thereby preserving spatial continuity and smoothness while restoring spatial resolution. Finally, a
convolutional layer is applied to compress and integrate the feature channels.
3. Results and Discussion
The numerical experiments were conducted on a workstation running the Windows 11 24H2 operating system, with hardware configurations including an NVIDIA GeForce RTX 4090 GPU (NVIDIA Corporation, Santa Clara, CA, USA), an Intel Core i7-13700KF CPU (Intel Corporation, Santa Clara, CA, USA), and 32 GB of RAM. The Python 3.12.3 environment was installed via Conda 24.11.3. The deep learning models were implemented using PyTorch 2.6.0 with CUDA 12.4. The implemented loss function and optimizer were L1Loss and Adam, respectively.
3.1. Experimental Details
3.1.1. Evaluation Metrics
The model in this study outputs a vector sequence
, representing the differences between the typhoon coordinates at 6 h, 12 h, 16 h, and 24 h after the last time step of the input sequence and the coordinates at the last input time step. Let the last input in the sequence be the coordinates at time
,
. The model’s predicted coordinates
are then calculated as:
We use the Average Position Error (APE) as the primary evaluation metric. APE is widely used in typhoon track prediction because of its accurate and intuitive measurement of prediction precision. In practice, the spherical distance between the predicted and actual coordinates is first calculated using spherical trigonometry. APE reflects the average great-circle distance between the predicted typhoon center and the true center, calculated as:
where
in Equation (10) is the number of samples at the current prediction time,
is the great-circle distance, and
in Equation (11) is the radius of the Earth (usually 6371 km).
3.1.2. Experimental Setup
In this study, data from 2011 to 2018 were used as the training set, 2019 to 2020 as the validation set, and 2021 to 2024 as the test set. The initial learning rate was set to 0.0005, the batch size to 32, and the number of training epochs to 150. L1 loss was used to compute prediction errors, and the Adam optimizer was employed to update model parameters. During the model pretraining stage, all other settings remained the same, but the number of training epochs was reduced to 50.
3.2. Comparative Experiments
To verify the effectiveness of the proposed method, we compared it with several models proposed by other researchers in recent years. The baseline methods include ViT-LSTM, CNN-LSTM, AM-ConvGRU, and a Spatio-temporal model. In the numerical experiments, these models used the same loss function, batch size, learning rate, and other hyperparameters as the proposed TGAT model.
Based on this setup, we compared the APE between the predicted and true tropical cyclone coordinates at forecasting horizons of 6 h, 12 h, 18 h, and 24 h. The numerical experimental results are summarized in
Table 3.
As shown in
Table 3, compared with the Spatio-temporal model, which achieves the lowest error among the baseline methods, the proposed TGAT model reduces the prediction error by 2.51%, 2.14%, 3.12%, and 3.48% at different forecasting horizons within the 24 h prediction task. These results demonstrate the superiority of the TGAT model in tropical cyclone track prediction, with particularly notable improvements in long-term forecasting performance.
In addition, we evaluated the cross-year stability of different models by comparing their average errors across different years. The results indicate that the TGAT model achieves the lowest error in all years except 2024, suggesting strong adaptability and robustness. This demonstrates that TGAT is capable of stable and reliable prediction across most tropical cyclones, rather than performing well only under specific conditions.
As shown in
Table 4, compared with the Spatio-temporal model, which achieves the lowest error among the baseline methods, the proposed TGAT model reduces the prediction error by 2.51%, 2.14%, 3.12%, and 3.48% at different forecasting lead times within the 24 h prediction task. These results confirm that the TGAT model has a clear advantage in tropical cyclone track prediction, particularly in terms of its longer-term forecasting capability.
Furthermore, we compared the average errors of different models across multiple years to evaluate their cross-year stability. The results show that the TGAT model achieves the lowest error in all years except 2024, indicating strong adaptability and robustness. This suggests that the TGAT model can reliably predict the tracks of most tropical cyclones, rather than performing well only under specific conditions.
To more intuitively demonstrate the effectiveness of the TGAT model in tropical cyclone track prediction, we randomly selected four intense typhoons from the test set and generated their predicted tracks based on the 24 h forecast positions, which were then compared with the corresponding observed tracks. As shown in
Figure 7, the TGAT model is generally able to accurately capture the overall movement trends of tropical cyclones. However, larger prediction errors tend to occur when cyclones undergo abrupt changes in direction. This is likely due to the increased complexity of the atmospheric environment under such conditions, making it more difficult for the model to fully capture and represent the key influencing features.
Furthermore, to evaluate the practical performance of the proposed model, TGAT was compared with the operational global forecast system CMA-GFS of the China Meteorological Administration. According to the 2023 typhoon forecast evaluation results (Yang et al., 2025) [
35], the 24 h mean track forecast error of CMA-GFS was 77.4 km. Although the proposed model has not yet reached the accuracy level of operational numerical weather prediction systems, it demonstrates advantages in computational cost and data dependency as a deep learning-based approach. Under the same input data conditions, TGAT can achieve reasonable track prediction performance with lower computational resources, providing an efficient technical framework for tropical cyclone track forecasting. Future work will focus on incorporating additional environmental variables and further improving the model architecture to narrow the gap with operational systems while maintaining computational efficiency.
3.3. Ablation Study
To verify the effectiveness of each component in the proposed model, we conducted a series of ablation experiments. In the ablation study, four models with different structural completeness were compared in terms of prediction accuracy to evaluate the contribution of each module. Among them, CNN + LSTM serves as the most basic prediction model. Its overall structure is similar to that of TGAT, but it only employs a CNN module to encode atmospheric reanalysis features and does not introduce the Transformer architecture. The CNN + Transformer + LSTM model further incorporates a standard Transformer module to perform additional modeling and fusion of the features extracted by the CNN. For the complete TGAT model, two configurations were evaluated, namely with and without pretraining, in order to assess the impact of the axial Transformer structure and the dual-branch pretraining strategy on model performance. The results of the ablation experiments are presented in
Table 5.
It can be observed that the Average Position Error (APE) gradually decreases as the model architecture becomes more sophisticated. Specifically, incorporating a Transformer module to further extract features from the CNN outputs leads to a significant improvement in prediction accuracy, indicating that the Transformer is effective in capturing informative representations from convolutional features. Building upon this, the adoption of the TGAT architecture results in further improvements in prediction accuracy, demonstrating its superior capability in modeling spatial dependencies and their dynamic evolution. In addition, the introduction of a pretraining strategy further improves the model’s data-fitting ability.
3.4. Model Interpretability Analysis
To further investigate the contribution of different input features, feature ablation and gradient-based saliency analyses were conducted, as shown in
Figure 8 and
Figure 9. In the computation of the saliency maps in
Figure 9, the results of each feature were averaged over three pressure levels. In addition, a smoothing operation was applied to reduce noise and enhance interpretability.
As illustrated in
Figure 8, removing variables such as W, V, and VP leads to a significant increase in prediction error, with the maximum increase reaching up to 36%, indicating that these variables play a dominant role in model performance. In contrast, GH, RH, and Q have relatively smaller impacts on the prediction results.
The spatial saliency results in
Figure 9 further indicate that the model tends to focus on localized key regions, while maintaining a moderate level of response across most areas. Combined with
Figure 8, it can be observed that features with higher importance generally exhibit more concentrated and prominent high-response regions in
Figure 9. In contrast, features with lower importance, such as GH, RH, and Q, tend to show more dispersed regions with moderate responses. This suggests that the former contains more concentrated and informative patterns that are easier for the model to capture, whereas the latter exhibit more complex or less distinct spatial structures, making them harder to be effectively learned.
It is also noteworthy that, although most high-saliency regions are concentrated around the typhoon center, relatively strong responses are also observed in the upper-right regions for features such as T, Q, STR, and VP. Considering the geographical characteristics of the Western North Pacific, this may indicate that the model pays attention to the variations of these features when the typhoon approaches coastal or land areas.
Overall, the combination of feature importance analysis and spatial saliency analysis demonstrates that the proposed model is capable of not only identifying key input variables but also learning their spatial distributions, thereby enhancing model interpretability. However, for certain features, the model still shows limitations in representation capability, suggesting that its ability to capture complex patterns could be further improved.
4. Conclusions
This study investigates tropical cyclone track prediction using deep learning-based approaches. Specifically, to enable long-term tropical cyclone track forecasting, we constructed a hybrid dataset with high temporal and spatial coverage spanning 2011 to 2024 by integrating CMA and CFSv2 datasets. A sliding-window strategy was employed to generate the model inputs and outputs. In addition, to improve data quality and enhance model fitting capability, normalization was applied separately to different data sources.
To address the challenge of insufficient spatial feature extraction in long-term forecasting, we proposed the Typhoon-Gated Axial Transformer (TGAT) model. The proposed model combines the efficiency of convolutional neural networks in local feature extraction with the ability of Transformers to model global dependencies. Furthermore, a gated axial attention mechanism was introduced to effectively control parameter redundancy in Transformer-based image modeling, thereby improving computational efficiency and generalization performance. A pretraining strategy was also incorporated to enhance the model’s capability to perceive and model future environmental changes.
Results from the numerical experiments demonstrate TGAT achieves lower Average Position Error (APE) than all comparison models in 6–24 h track prediction tasks and exhibits superior robustness in cross-year stability evaluations. In addition, ablation experiments confirm the contributions of individual model components, indicating that both the Transformer module and the gated axial attention mechanism play critical roles in improving prediction accuracy.
Although TGAT achieves strong performance on most test samples, we observe that prediction errors increase when tropical cyclones undergo abrupt directional changes. This limitation may arise from the difficulty of learning complex atmospheric features under special environmental conditions from limited samples. Moreover, while the gated axial attention mechanism effectively reduces the computational cost of Transformer-based image modeling, the overall computational overhead of the model remains higher than that of other comparable methods. Future work will focus on improving the model’s fitting capability for rare and extreme cases, as well as exploring more efficient architectural designs to further reduce computational complexity and enhance operational efficiency.