1. Introduction
The accurate forecasting of global fish catch is a critical challenge for ensuring food security, maintaining marine ecological balance, and informing economic policies on fisheries [
1,
2,
3]. With the increasing threats of overfishing and climate change, sustainable fisheries management has become an urgent priority [
4]. Effective resource allocation, policy planning, and catch quota adjustments rely on robust forecasting models that can predict fishery trends over various time scales, including short-term (e.g., three years), medium-term (e.g., five years), and longer-term (e.g., seven years) horizons [
5]. However, achieving high-precision predictions remains complex due to the dynamic nature of marine ecosystems, market fluctuations, policy interventions, and the significant influence of external environmental drivers like climate variability [
6].
Traditional time-series forecasting methods, such as AutoRegressive Integrated Moving Average (ARIMA) and Seasonal ARIMA, have been used for fish catch prediction [
7]. While useful for linear trends, they struggle to integrate multiple external factors (like climate data) and capture complex non-linear dependencies [
8]. Moreover, they often require extensive manual feature engineering and assumptions, limiting their adaptability [
9].
Deep learning models, particularly Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) networks, have shown promise [
10,
11]. However, their effectiveness can be limited in multi-horizon forecasting: they may struggle with very long sequences, and often lack inherent interpretability, hindering expert insights into the domain [
12]. Other advanced techniques like Temporal Convolutional Networks (TCNs) [
13] and gradient boosting machines like XGBoost offer alternative powerful approaches but may face challenges in seamlessly integrating diverse input types (static, dynamic observed, dynamic known) or providing built-in temporal interpretation mechanisms suited for complex time-series interactions.
Recently, Temporal Fusion Transformer (TFT) has emerged as a state-of-the-art deep learning framework specifically designed for multi-step forecasting with high interpretability [
14,
15,
16]. TFT combines recurrent layers with self-attention, enabling it to capture both local and long-range temporal dependencies. Crucially, it incorporates mechanisms to handle various input types (static metadata, past time series, known future inputs) and provides built-in variable selection and attention-based interpretability [
17,
18]. Despite its potential, systematic application and rigorous evaluation of TFT, especially integrating key climate drivers and assessing performance over extended forecast horizons, for global fish catch forecasting remain limited.
This study aims to bridge this gap by developing and evaluating a TFT-based forecasting model using historical fish catch data (1950–2020) from the Sea Around Usdataset augmented with climate indicators (annual ONI index and global SST anomalies). Specifically, we seek to achieve the following: First, construct a TFT model capable of predicting global fish catch trends over three-year, five-year, and seven-year horizons, leveraging historical catch, static attributes, and climate data. Next, comprehensively compare TFT’s performance against a wide range of benchmarks, including ARIMA, Multi-layer Perceptron (MLP), LSTM, TCN, and XGBoost, evaluating predictive accuracy across all horizons. Then, utilize TFT’s built-in interpretability tools to analyze key factors influencing fish catch fluctuations, including the relative importance of historical catch versus climate signals.
The primary contributions of this work are as follows:
We propose and validate a TFT-based approach for short-to-long-term (3, 5, and 7 years) global fish catch forecasting, demonstrating its ability to effectively integrate historical trends, static identifiers, and crucial climate variability data while maintaining explainability.
We fine-tune the TFT architecture for the specific characteristics of the Sea Around Us dataset enriched with climate information.
We systematically benchmark TFT against five diverse established models (ARIMA, MLP, LSTM, TCN, XGBoost), showcasing its advantages, particularly in its robustness for longer forecasting horizons and its capacity to leverage external climate drivers.
By leveraging state-of-the-art deep learning techniques and incorporating essential environmental context, this study provides an interpretable and high-accuracy forecasting framework, offering valuable insights for sustainable fisheries management, especially for medium-to-longer-term planning horizons.
3. Data and Task Definition
3.1. Data Source
The primary dataset used in this study is obtained from the Sea Around Us platform [
38], providing reconstructed fish catch estimates from 1950 to 2020. It covers various regions, exclusive economic zones, and species groups. For this study, we have selectively chosen data from the top 10 fishing nations based on their total reported catch volume over the past two decades (
Table 2). Additionally, we focus on 10 major commercially significant species groups, ensuring the dataset remains representative yet computationally manageable. An example subset of the data for Norway is shown in
Figure 1.
The selected dataset consists of approximately 3.2 million records, with annual fish catch values reported in metric tons. Each record includes attributes like country of origin, species group classification, fishing gear type, and commercial sector classification, as shown in
Table 3. To enhance the predictive power by incorporating crucial environmental context, we augmented this dataset with two key annual climate indicators:
Annual Oceanic Niño Index (ONI): Derived from the monthly ONI data provided by the NOAA Climate Prediction Center (CPC) [
41], the ONI, based on a 3-month running mean of SST anomalies in the Niño 3.4 region (5° N–5° S, 120°–170° W), is a primary indicator for monitoring El Niño and La Niña events. We calculated the annual average ONI for each year from 1950 to 2020.
Annual global mean sea surface temperature (SST) anomaly: This is calculated from the monthly NOAA Extended Reconstructed Sea Surface Temperature (ERSST) v5 dataset [
42]. This dataset provides global gridded SST data back to 1854. We computed the global average SST anomaly for each month relative to a 1971–2000 baseline and then averaged these monthly anomalies to obtain an annual value for 1950–2020.
These climate indicators were integrated into the dataset, aligning them by year with the corresponding fish catch records.
Preprocessing steps were implemented, including aggregation, normalization of numerical features (including catch volume and climate indicators, using standardization based on training set statistics), and categorical encoding of static attributes. Missing values in historical records were handled using interpolation or zero-padding. The dataset was structured globally and for specific species groups.
3.2. Prediction Task
The objective is to develop a forecasting model capable of predicting global and species-specific fish catch volumes over multiple time horizons using historical data including catch history and climate indicators. The prediction task is framed as a multi-horizon forecasting problem, where the model utilizes past fishery data to estimate cumulative catch over three-year, five-year, and seven-year periods.
The target variable for prediction is the cumulative fish catch over predefined forecasting horizons. Given historical data up to year t, the model is required to predict the total catch for the subsequent periods: to (3-year forecast), to (5-year forecast), and to (7-year forecast). For example, when trained on data up to 2010, the model generates predictions for the cumulative catch from 2011 to 2013, 2011 to 2015, and 2011 to 2017. This approach captures short-term fluctuations, medium-term trends, and allows for longer-term strategic assessment.
The dataset is partitioned into three subsets to facilitate training, validation, and testing:
Training set (1950–2010): Used to learn temporal patterns and model parameters.
Validation set (2011–2015): Used for hyperparameter tuning and early stopping.
Test set (2016–2020): Used for final evaluation of forecasting performance on unseen data.
For evaluating the 7-year forecast performance on the test set (ending in 2020), predictions starting in 2016, 2017, etc., are assessed based on the available actual data up to 2020. Specifically, the error for a 7-year forecast initiated in year (where ) is calculated over the available future steps .
3.3. Feature Engineering
To ensure the model captures essential temporal dependencies and explanatory factors, the dataset is structured into feature categories suitable for TFT:
Static covariates: Categorical attributes that remain constant over time for each entity (time series), including the country identifier and species group classification. These are processed using embedding layers to learn meaningful representations.
Past observed inputs: Time-dependent features observed up to the current time step t. This includes the historical fish catch volumes (target variable history) and the historical climate indicators (annual ONI, annual global SST anomaly) spanning a rolling window of 20 years (i.e., from to t).
Future known inputs: Time-related attributes known for the entire forecast horizon ( to , where is 3, 5, or 7). In this study, this primarily includes the relative time index (e.g., 1 for the first forecast step, 2 for the second, etc.) and potentially the calendar year embedding if deemed known.
This structuring allows TFT to leverage different types of information appropriately during the encoding and decoding phases.
3.4. Evaluation Metrics
To systematically assess the performance of the forecasting models, multiple standard error metrics are employed:
Root mean squared error (RMSE): Measures the square root of the average squared differences between predicted (
) and actual (
) values. It penalizes larger errors more heavily.
Mean absolute error (MAE): Measures the average absolute differences between predicted and actual values, providing a linear score of the error magnitude.
Mean absolute percentage error (MAPE): Measures the average absolute percentage difference relative to the actual value. It is scale-independent but sensitive to zero or near-zero actual values (which are avoided here due to log-transformation or handling).
Lower values for all metrics indicate better predictive accuracy. These metrics are calculated separately for the 3-year, 5-year, and 7-year cumulative forecast horizons on the test set.
5. Experiments
5.1. Data Preparation and Preprocessing
The dataset, sourced from Sea Around Us and augmented with annual ONI and global SST anomaly data, covers global fishery catch records from 1950 to 2020. Preprocessing involved handling missing values (linear interpolation or zero-padding), aggregating data where necessary, and feature scaling. Numerical features, including the target variable (fish catch volume) and the climate indicators (ONI, SST anomaly), were transformed using to reduce skewness and then standardized by subtracting the mean and dividing by the standard deviation, with parameters derived solely from the training set (1950–2010). Categorical features (country, species) were encoded. The data were structured for input into the time-series models, with temporal partitioning into training (1950–2010), validation (2011–2015), and test (2016–2020) sets.
5.2. Baseline Models for Comparison
To rigorously evaluate the TFT model’s effectiveness, five diverse baseline methods were implemented and evaluated on the same dataset:
ARIMA (Autoregressive Integrated Moving Average): Where applicable, exogenous climate variables (ONI, SST) were included, effectively creating an ARIMAX model. Models were fitted independently for each country–species time series.
MLP (Multi-layer Perceptron): A feedforward neural network with two hidden layers (ReLU activation) using a flattened sequence of the past 20 years of catch and climate data as input features to directly predict the cumulative catch for each horizon (3, 5, 7 years).
LSTM (Long Short-Term Memory) network: A standard two-layer LSTM network processing sequences of the past 20 years of catch and climate data to forecast future steps, from which cumulative values were derived.
TCN (Temporal Convolutional Network): Implemented using stacked dilated causal convolutional layers to capture temporal dependencies from the 20-year input sequences of catch and climate data.
XGBoost: An optimized gradient boosting decision tree model. Input features included lagged values of catch and climate data, rolling window statistics, and encoded static features.
The primary model evaluated was the Temporal Fusion Transformer (TFT) [
14], configured as described in
Section 4 to leverage static, past observed (catch, ONI, SST), and future known (time index) inputs.
5.3. Hyperparameter Configuration and Training Strategy
The model hyperparameters were tuned based on performance (minimizing RMSE on the 5-year forecast) on the validation set (2011–2015). Key parameters for TFT included the follwing: hidden state size of 64; two attention heads; dropout rate of 0.2; and batch size of 128. For the LSTM and TCN, hidden layer sizes/filter numbers and layer counts were tuned. For the MLP, the hidden layer dimensions were optimized. For XGBoost, parameters like tree depth, number of estimators, and learning rate were tuned using grid search or randomized search. All neural network models (TFT, LSTM, TCN, MLP) were trained using the Adam optimizer with an initial learning rate of 0.001 and employed the quantile loss (quantiles: 0.1, 0.5, 0.9). Training proceeded for a maximum of 100 epochs, utilizing early stopping with a patience of 10 epochs based on validation loss to prevent overfitting. Each model was trained and evaluated three times using different random seeds to ensure robustness of the results; the average performance across these runs is reported.
6. Results and Analysis
This section presents an empirical evaluation of the Temporal Fusion Transformer model against the selected baselines across different forecasting horizons (3, 5, and 7 years). All models were trained and evaluated using the dataset enriched with historical catch data and the annual ONI and global SST anomaly climate indicators.
6.1. Overall Prediction Accuracy
The comparative forecasting performance of all evaluated models on the test set (2016–2020) is detailed in
Table 4. The results across the RMSE, MAE, and MAPE metrics reveal that while performance varies by horizon and metric, the Temporal Fusion Transformer consistently demonstrates robust and often superior accuracy, particularly for the extended 7-year forecast.
For the 7-year horizon, TFT achieves the lowest error across all metrics, recording an RMSE of 2.18, MAE of 1.71, and MAPE of 13.7%. This marks a clear advantage over the next best models at this horizon, LSTM (MAPE 15.1%) and TCN (MAPE 15.4%). The relative MAPE improvement of TFT over LSTM at 7 years is approximately 9.3%. This superior long-term performance underscores TFT’s capability in capturing enduring temporal dependencies potentially modulated by multi-year climate cycles present in the input data.
At the 5-year horizon, TFT also leads in overall performance, achieving the lowest RMSE (1.94), MAE (1.49), and MAPE (12.0%). For the 3-year horizon, the competition is closer; while LSTM achieves a slightly lower MAPE (9.6% vs. TFT’s 9.7%), TFT secures the best RMSE (1.60) and MAE (1.22), indicating high accuracy even in the shorter term.
When compared against the strong non-recurrent baselines, TFT maintains its lead. Its 7-year MAPE of 13.7% is considerably better than XGBoost’s 16.5% and TCN’s 15.4%. This suggests that TFT’s architecture, specifically designed for multi-horizon forecasting with heterogeneous inputs and attention mechanisms, effectively leverages the combined information from catch history, climate signals, and static attributes more efficiently than these alternative approaches for this specific task.
As expected, the traditional ARIMA model and the simpler MLP architecture exhibit significantly higher errors, especially as the forecast horizon lengthens. This reflects their inherent limitations in modeling the complex non-linear dynamics and diverse data types characterizing global fisheries.
Figure 4 provides a visualization of the spatial distribution of prediction errors derived from the TFT model’s 5-year forecasts. It highlights geographical variations in predictability, with larger errors often coinciding with regions known for highly variable fish stocks or those particularly sensitive to large-scale climate events.
6.2. Prediction Performance at Species Level
Table 5 illustrates the TFT model’s forecasting accuracy for ten commercially important fish species across the 3-, 5-, and 7-year horizons. The results highlight substantial heterogeneity in predictability among different species groups. This variation likely stems from a combination of factors, including species-specific life history traits, differing sensitivities to environmental fluctuations captured by the ONI and SST inputs, varying fishing pressures, and data quality differences.
For instance, Atlantic herring maintains relatively high predictability, exhibiting the lowest 7-year MAPE among the group at 10.8%. This suggests its population dynamics within the studied period might be more stable or better explained by the model’s input features. Conversely, cephalopods consistently show higher prediction errors, reaching a 7-year MAPE of 14.2%. This aligns with the known rapid turnover and high sensitivity of many cephalopod populations to oceanographic conditions, making them inherently more challenging to forecast accurately, even with climate data included. Other species like anchovy and cod also show relatively higher errors over longer horizons, possibly linked to recruitment variability and strong ENSO influences (for anchovy) or complex stock dynamics and historical fishing impacts (for cod).
6.3. Model Interpretation and Feature Importance
Understanding the key drivers behind the forecasts is essential. TFT’s variable selection network (VSN) provides quantitative insights by assigning importance scores to each input feature, averaged over time steps and forecast horizons, as shown in
Figure 5.
Table 6 displays the feature importance ranking from the trained TFT model. As anticipated, the most recent historical catch volume (past 1-year fish catch) remains the most influential feature, with an importance score of 0.31, reflecting the strong autoregressive nature of fisheries data. However, the integrated climate indicators demonstrate substantial predictive power. The annual ONI Index ranks as the third most important feature with a score of 0.15, confirming the substantial influence of ENSO patterns on global fisheries, as captured by the model. The annual global SST anomaly also contributes notably, ranking fifth with a score of 0.10.
Together, these two climate variables account for 25% of the total feature importance, clearly demonstrating their value in complementing the information derived purely from historical catch trends (1-year and 3-year average catch combine for 46% importance). Temporal context (year embedding, 12%) and static attributes differentiating entities (country ID, 9%; species group ID, 8%) also contribute meaningfully. This analysis quantitatively validates the benefit of incorporating environmental context into the forecasting model, allowing it to capture dynamics beyond simple historical extrapolation, as shown in
Figure 6.
6.4. Ablation Study
To dissect the contribution of different architectural components within the TFT model, an ablation study was performed. Key modules were systematically removed, and the model was retrained and evaluated. The results, shown in
Table 7, quantify the impact of each component on the forecasting accuracy across the three horizons.
The most substantial performance degradation occurs when temporal covariates are removed (7-year RMSE increases by 0.32 from 2.18 to 2.50). This component group includes the time index and, crucially, the ONI and SST climate variables, highlighting the vital role of incorporating both sequential context and external environmental drivers for effective prediction. The exclusion of static embeddings (country, species identifiers) also significantly impacts performance (7-year RMSE rises to 2.38), confirming that entity-specific information provides essential context for the model.
Omitting the multi-head attention mechanism or the variable selection network leads to smaller, yet consistent, performance degradation. For example, removing attention increases the 7-year RMSE to 2.27. This indicates that while the core temporal processing (LSTM layers) and context integration (static/temporal covariates) are paramount, the attention mechanism effectively helps the model focus on relevant long-range historical patterns, and variable selection aids in filtering input features, both contributing positively to the final accuracy.
6.5. Sensitivity Analysis
Two sensitivity analyses were conducted on the final TFT model configuration to examine its robustness concerning the temporal granularity of the input data and the length of the historical input sequence (look-back window).
Table 8 shows the impact of aggregating the input data annually, quarterly, or monthly. The results show that quarterly aggregation yields slightly superior performance compared to annual aggregation across most metrics and horizons, achieving a 7-year RMSE of 2.14 versus 2.18 for annual data. This suggests that capturing some sub-annual dynamics offers a small benefit. However, using monthly data significantly degrades performance (7-year RMSE 2.42), likely because the increased noise outweighs any potential signal gain at this aggregation level for multi-year forecasting.
Table 9 assesses the impact of varying the input sequence length from 5 to 25 years. The analysis confirms that employing a 20-year look-back window provides the optimal balance for prediction accuracy, yielding the lowest errors across the board (7-year RMSE: 2.18). Using substantially shorter histories (5 or 10 years) markedly reduces performance, demonstrating the importance of capturing long-term dependencies and historical context, which includes multi-year climate patterns. Extending the history to 25 years provides no further significant improvement, indicating diminishing returns from incorporating data beyond two decades for predicting up to 7 years ahead in this context.
7. Conclusions
This study presented a comprehensive framework and empirical evaluation of the Temporal Fusion Transformer for multi-horizon global fish catch forecasting, demonstrating its effectiveness when integrating historical catch data with key climate indicators. By leveraging the extensive Sea Around Us database augmented with environmental context, the TFT model successfully predicted fishery trends over three-year, five-year, and notably, extended seven-year horizons.
Our experiments showed that TFT consistently delivered highly accurate predictions. It outperformed a diverse set of benchmark models including ARIMA, MLP, LSTM, TCN, and XGBoost, particularly on the challenging 7-year forecast horizon, where it achieved the lowest error (MAPE of 13.7%). This result highlights TFT’s superior capability in handling long-range dependencies and effectively fusing heterogeneous information sources compared to other architectures.
The model’s built-in interpretability provided valuable insights, quantifying the significant predictive power of the incorporated climate signals. The ENSO indicator and global SST anomaly collectively accounted for 25% of the feature importance, confirming their crucial role alongside historical catch trends in driving forecast outcomes. Ablation studies further validated the importance of both temporal context and static entity information, while sensitivity analyses confirmed the choice of quarterly aggregation and a 20-year look-back window as near-optimal for this application.
Despite the promising results, limitations remain. The use of global climate indices, while informative, might mask important regional environmental variations impacting specific fisheries. The study focused only on ONI and SST, neglecting other potential climate modes or ecological drivers (e.g., primary productivity, ocean currents). Furthermore, the evaluation of the 7-year forecast was necessarily limited by the available data endpoint (2020).
Future research should aim to incorporate higher-resolution regional climate and oceanographic data to potentially improve predictions for specific stocks. Exploring additional environmental drivers and socio-economic factors could further enhance model realism. Applying this framework to specific fishery management scenarios, such as evaluating adaptive quota setting under different climate projections, would demonstrate its practical utility. Overall, this research underscores the potential of advanced, interpretable deep learning models like TFT, when appropriately contextualized with environmental data, to significantly improve our ability to forecast complex ecological systems like global fisheries over meaningful short-to-long-term planning horizons relevant for sustainable management.