1. Introduction
The foreign exchange (FOREX) market, a global financial hub, enables continuous 24/5 trade of currencies across participants and time zones, serving as a vital conduit for international trade, investment, and speculation.
The FOREX ecosystem involves a multitude of actors, including multinational corporations engaged in cross-border trade, governments making strategic interventions, central banks shaping monetary policies, financial institutions providing liquidity and market-making services, and individual traders venturing into the online arena. Despite its complexity and volatility, the FOREX market attracts traders and investors through features like short selling and leverage. FOREX allows short selling, which is the act of selling a currency that the agent does not currently own, with the commitment to buy it in the near future.
Exchange rates—reflecting the value of one currency relative to another—are classified by regime (fixed or floating) and by type (nominal or real). Fixed exchange rates, dictated by governments or central banks, remain constant, while floating rates are determined by supply and demand. Nominal exchange rates represent the current market value of a currency pair, while real exchange rates account for inflation. The flexibility of floating rates allows currencies to adapt to changing economic conditions, facilitating trade and investment flows, promoting price stability, and maintaining external balance. The exchange rates of major reserve currencies, such as the US dollar (USD), the euro (EUR), the Japanese yen (JPY), and the British pound sterling (GBP), hold significant importance in the global economic landscape due to their crucial role in international trade and financial transactions.
Accurate exchange rate forecasts are essential for market participants, namely, traders, investors, businesses, and policymakers. These predictions inform decisions about currency trades, asset allocation, and risk management, ultimately impacting portfolio performance. Businesses and policymakers rely on these models to plan and execute international transactions, manage foreign currency exposure, and mitigate risks associated with currency fluctuations. Additionally, they play a role in formulating effective monetary and fiscal policies aimed at achieving macroeconomic stability, managing inflation, and fostering sustainable economic growth.
Traditionally, exchange rate forecasting has relied on fundamental and technical analyses. Fundamental analysis involves studying economic indicators like interest rates, inflation, GDP growth, trade balances, and even geopolitical developments to understand the underlying factors driving currency movements. Technical analysis, on the other hand, focuses on historical price data, chart patterns, and technical indicators to identify trends and predict future price movements. It is important to note that expert opinions, intuition, and qualitative judgments are sometimes used. However, their success varies considerably.
In recent years, advanced computational techniques such as machine learning (ML) and artificial intelligence (AI) have transformed exchange rate forecasting. These approaches leverage big data analytics, deep learning, and neural networks to process vast amounts of financial data, identify complex patterns, and generate precise predictions. Deep learning models have proven particularly effective, demonstrating superior predictive capabilities compared to traditional methods [
1]. They are set apart by their ability to capture nonlinear relationships, temporal dependencies, and high-dimensional features inherent in financial time series data. Classic architectures like convolutional neural networks (CNNs) and recurrent neural networks (RNNs) have limitations. For instance, CNN pooling layers disregard crucial part–whole correlations and lose valuable data, while RNNs are prone to gradient vanishing or exploding issues during backpropagation.
Addressing these challenges, Vaswani et al. (2017) [
2] introduced the Transformer, a novel deep learning model. This model, originally excelling in natural language processing (NLP) tasks, replaces traditional CNN and RNN frameworks with an attention mechanism. Unlike the sequential structure of RNNs and LSTM, the Transformer’s self-attention mechanism can be trained in parallel and requires less complexity to gather global information.
While Transformer architectures have revolutionized NLP tasks, such as machine translation and language modeling, financial institutions are exploring their ability to tackle the complexities of financial time series forecasting. This integration offers significant potential given the challenges inherent in predicting financial market behavior. Additionally, advancements in computing technology and data availability have facilitated the widespread adoption of Transformer-based models by academic researchers seeking an edge in currency trading and investment strategies.
Advancements in technology alone are not a panacea, especially when considering the recent complexities in the FOREX market, such as the fluctuations caused by the COVID-19 pandemic and the Russia–Ukraine conflict. These events posed significant challenges for traders in predicting currency pair movements, as economic factors, sentiment, and geopolitical developments all played a significant role in shaping exchange.
Inspired by the success of Transformers in modeling sequential data in NLP, the same concept is employed in this study to forecast the evolution of exchange rates in the FOREX market. While recent studies have explored Transformer models primarily for trading strategy back-testing [
3,
4,
5,
6], our work focuses on applying Transformers directly to next-day closing price prediction in the FOREX market. To the best of our knowledge, this represents the first such application of Transformer architectures for this specific forecasting task in FOREX.
In their analysis, Fisher et al. [
5] examine Transformer models with time embeddings for FX-Spot forecasting, comparing results with traditional models like LSTM for major currency pairs (EUR/USD, USD/JPY, GBP/USD) from November 2020 to January 2022. Their method includes both univariate and multivariate models, utilizing historical prices along with technical and fundamental data. Findings reveal that Transformers significantly outperform LSTM. Transformers demonstrated strength in noisy, high-frequency environments, proving effective for complex financial series.
Gradzki & Wojcik [
4] focus on high-frequency FOREX trading with Transformers, comparing them to ResNet-LSTM across six currency pairs and five time intervals (60 to 720 min). The study employs a Transformer architecture for forecasting, enhanced by technical analysis for improved accuracy. The findings indicate that Transformers slightly outperform ResNet-LSTM, especially in longer intervals (480, 720 min). However, transaction costs significantly impact performance in shorter intervals (e.g., 60 min), underscoring the necessity for realistic back-testing.
Exploring a Transformer Encoder model for minute-level FOREX trading, ref. [
6] specifically focus on EURUSD and GBPUSD. The model integrates Exponential Moving Averages (EMA) with varying smoothing factors to better capture price trends. Trained on data from July 2023, it achieves a cross-entropy loss below 0.2, indicating strong predictive accuracy. However, profitability is limited by high-frequency trading costs, as spreads can negate gains, demonstrating that real-world outcomes are significantly affected by transaction costs.
In a significant contribution, Kantoutsis et al. [
3] present the Momentum Transformer, an attention-based deep learning model that outperforms traditional momentum and mean reversion strategies as well as LSTM-based models. By leveraging attention mechanisms, it captures long-term dependencies and adapts to market shifts, such as those seen during the SARS-CoV-2 crisis. Back-testing from 1995 to 2020 reveals superior performance, particularly in recent years and during significant market events. While the hybrid Temporal Fusion Transformer (TFT) performed best overall, pure attention models also demonstrated strong performance. The study suggests an ensemble approach for improved results across asset classes and highlights the model’s robustness in commodities trading.
In our approach, we test the forecasting ability of our Transformer-based model, called EXPERT, on nine currency pairs—EUR/USD, AUD/CAD, EUR/AUD, EUR/CAD, GBP/AUD, NZD/USD, USD/JPY, USD/MXN, and BRL/USD—and evaluate it against six widely used forecasting models: the Stochastic Gradient Descent (SGD), the Bagging Regression (BGR), the Extreme Gradient Boosting (XGB), the Random Forest (RF), the Linear Regressor, and the Long Short-Term Memory (LSTM) models.
Each dataset for these nine currency pairs has been individually used in every forecasting model, using the classic training–testing scheme. The training set is utilized to fine-tune the parameters of the model; the performance of the trained models is evaluated on the testing set. All models predict the closing price for the next day. The estimated value is then evaluated with the actual values.
This paper is organized as follows.
Section 2 reviews related work, while
Section 3 presents the collected dataset. Every aspect of the EXPERT model is analyzed in
Section 4. The alternative forecasting models are briefly introduced in
Section 5.
Section 6 presents the evaluation metrics used. The forecasting performance of the EXPERT model against the competition is presented in
Section 7. In the same section, we present the performance of the EXPERT model for larger forecasting horizons and evaluate its performance using the Diebold–Mariano (DM) test to compare its forecast accuracy with alternative methods, followed by the Multiple Comparisons with the Best method on five samples from our dataset. In
Section 8, we evaluate the success of a Transformer-based automatic trading system against other similar systems, and in
Section 9, we conclude this paper.
2. Related Work
A systematic review of the existing literature was conducted to provide a comprehensive understanding of machine learning prediction models for exchange rates prediction [
7,
8,
9,
10].
Islam et al. [
10] report improvements from a hybrid GRU–LSTM model that outperforms standalone GRU, LSTM, and simple moving average (SMA) models across multiple metrics, including MSE, RMSE, MAE, and
. Comparisons were made against these benchmarks to demonstrate the efficacy of the proposed model. The models were tested on historical foreign exchange data for four major currency pairs, EUR/USD, GBP/USD, USD/CAD, and USD/CHF, using a dataset that spans 1 January 2017, to 30 June 2020. The hybrid model predicted closing prices for these currency pairs at 10 and 30 min intervals, demonstrating superior predictive capability.
In their study on exchange rate prediction, Panda et al. [
9] found that a hybrid GRU-LSTM model effectively predicts future closing prices in the FOREX market. Applied to major currency pairs (EUR/USD, GBP/USD, USD/CAD, USD/CHF), this model outperformed standalone GRU, LSTM, and simple moving average (SMA) models in MSE, RMSE, and MAE for 10 min intervals, and it excelled with GBP/USD and USD/CAD in 30 min intervals. It also achieved a higher
score, indicating a lower prediction risk. Using a dataset of closing prices from 1 January 2017, to 30 June 2020, the model showed strong predictive capabilities, though it struggled during sudden price fluctuations. Future enhancements are planned, including applications to more currency pairs and shorter timeframes.
The findings of [
8] emphasize the clear advantages of machine learning algorithms over traditional stochastic models in financial market forecasting. After surveying more than 150 relevant articles, the study demonstrates that machine learning algorithms generally outperform stochastic methods by effectively capturing nonlinear dynamics in financial time series across various asset classes and market geographies. Recurrent neural networks (RNNs) outperform feedforward neural networks and support vector machines, likely due to their ability to capture temporal dependencies.
The paper by Seze et al. [
7] reports significant advancements in deep learning (DL) models for financial time series forecasting, showcasing their superiority over traditional machine learning approaches. Long Short-Term Memory (LSTM) networks are favored for their effectiveness in handling time-varying data and capturing temporal dependencies. More than half of the studies surveyed focus on recurrent neural networks (RNNs) for price trend predictions, while deep multilayer perceptrons (DMLPs) are often used for classification tasks. Growing interest in deep reinforcement learning (RL) for algorithmic trading gives us the opportunity to integrate behavioral finance insights.
Fletcher [
11] demonstrates that machine learning techniques can be effectively applied to forecast currency movements. Their findings indicate that it is possible to forecast the directional evolution (up, down, or within the bid-ask spread) of the EUR/USD pair between 5 and 200 s into the future, with accuracy rates ranging from 90% to 53%, respectively. Additionally, they have shown that it is feasible to predict price turning points for a basket of currencies in a way that can be profitably exploited.
Goncu [
12] applied several machine learning regression methods—Ridge, decision tree, support vector, and Linear Regression—to predict monthly USD/TRY exchange rates. Key macroeconomic factors, such as domestic money supply, interest rates, and the prior month’s exchange rate, are used for prediction. Among the tested models, Ridge regression delivers the most accurate forecasts, with relative errors under 60 basis points. Out-of-sample back-testing over various time periods confirms Ridge’s superior performance, suggesting it effectively balances accuracy and overfitting. The model can also support scenario analysis, helping policymakers and investors assess the impact of interest rate changes on exchange rates.
Research by Qi et al. [
13] introduces event-driven features to improve FOREX trading predictions by identifying trend changes and retracement points for optimal trade entry. The authors tested deep learning models, including LSTM, BiLSTM, and GRU, against a baseline RNN, with GRU and BiLSTM outperforming the others across various currency pairs. The best model, GRU with 60 time steps for EUR/GBP, achieved an RMSE of 1.50 ×
and a MAPE of 0.12%, surpassing previous studies. These findings show that the proposed models, combined with event-driven features, can provide accurate, low-risk trading strategies.
The development of more advanced models was proposed by Islam & Hosssain [
14] when they introduced a network combining a GRU with an LSTM for improved FOREX rate prediction.
Recent studies have increasingly applied advanced deep learning and adaptive learning strategies to improve forecasting accuracy in the FOREX market. Das et al. [
15] proposed a deep learning-based framework for trend prediction using multiple LSTM variants, including Vanilla, Stacked, Bidirectional, CNN LSTM, and Conv LSTM, to model short- and long-term price movements of INR-based currency pairs (GBP/INR, AUD/INR, USD/INR). The predicted trends were validated against traditional technical indicators such as ADX, ROC, momentum, CCI, and MACD, demonstrating the reliability of LSTM-based architectures in capturing market dynamics and identifying low-risk trading entry and exit points.
Addressing the limitations of static models, Bousbaa et al. [
16] introduced a data stream mining (DSM) approach for financial time series forecasting, integrating online Stochastic Gradient Descent (SGD) with Particle Swarm Optimization (PSO) to adaptively learn from evolving FOREX data. Their model employed sliding windows that adjusted dynamically to changes in data stationarity, enabling it to effectively capture shifting market behaviors. The results showed that this adaptive DSM framework outperformed fixed window methods, achieving higher forecasting accuracy and greater robustness to concept drift.
Extending the application of deep learning models to volatility prediction, Zitis et al. [
17] incorporated complexity measures, specifically the Hurst exponent and fuzzy entropy, into RNN, LSTM, and GRU models to enhance FOREX market volatility forecasting. Using intraday data from major currency pairs (EUR/USD, GBP/USD, USD/CAD, and USD/CHF), they found that including these complexity metrics significantly improved predictive accuracy, with LSTM and GRU models outperforming traditional RNNs. The study highlighted the potential of combining complexity-based features with deep architectures to enhance risk assessment and trading decisions.
Zhao et al. [
18] evaluated Transformer-based architectures, namely, the original Transformer, Informer, and Temporal Fusion Transformer (TFT), for exchange rate prediction across four NZD currency pairs (NZD/USD, NZD/CNY, NZD/GBP, and NZD/AUD). The TFT achieved the highest predictive accuracy (R
2 up to 0.94) and lowest RMSE and MAE, while the Informer demonstrated faster convergence owing to its sparse attention mechanism. Furthermore, integrating the VIX index into the TFT model further enhanced prediction accuracy. This work underscores the growing effectiveness of Transformer-based models in capturing complex temporal dependencies in exchange rate forecasting.
3. The Dataset
In this study, our objective is twofold: (a) to create a Transformer-based model (EXPERT) for forecasting exchange rates and (b) to test it against a set of well-known forecasting methodologies. To ascertain the overall most accurate forecasting model, we must test them on a rich and diverse dataset that includes exchange rates with different characteristics. Testing our models on multiple pairs helps to mitigate the risk of overfitting to the specific market characteristics present in a single currency pair, increasing the model’s reliability and applicability to real-world trading scenarios. The dataset was compiled using the Metatrader application and comprises two major, five minor, and three exotic exchange rates, spanning the period from January 1999 to March 2022 for the weekdays. Each entry in the dataset contains the open, high, low, and close values. In the following, we report the exact dimensions of every exchange rate dataset:
AUD/CAD: 12 February 2007 to 3 March 2022 (3895 records).
USD/JPY: 4 January 1999 to 3 March 2022 (5990 records).
EUR/AUD: 17 June 2004 to 3 March 2022 (4580 records).
EUR/CAD: 17 June 2004 to 3 March 2022 (4580 records).
EUR/USD: 4 January 1999 to 3 March 2022 (5990 records).
GBP/AUD: 21 August 2007 to 3 March 2022 (3760 records).
NZD/USD: 4 January 1999 to 3 March 2022 (5994 records).
USD/MXN: 23 July 2007 to 3 March 2022 (3668 records).
BRL/USD: 17 January 2000 to 21 March 2019 (5000 records).
The dataset was divided into training (68.8%), validation (17.2%), and testing (14%) subsets, with the first 86% of data used for training and validation. The remaining 14%, was used as validation (out-of-sample) data to evaluate the models’ ability to generalize to new, unseen data. This data-splitting strategy provided a robust evaluation of the model’s performance across both the training and out-of-sample data.
The currency pairs’ exchange rates were categorized as major, minor, or exotic based on the European Securities and Markets Authority (ESMA). In this context, the currency pairs in our datasets are classified as follows:
Major currency pairs:
- 1
EUR/USD (euro/US dollar).
- 2
USD/JPY (US dollar/Japanese yen).
Minor currency pairs (cross-currency pairs):
- 3
EUR/AUD (euro/Australian dollar).
- 4
EUR/CAD (euro/Canadian dollar).
- 5
AUD/CAD (Australian dollar/Canadian dollar).
- 6
GBP/AUD (British pound/Australian dollar).
- 7
NZD/USD (New Zealand dollar/US dollar).
Exotic currency pairs:
- 8
USD/MXN (US dollar/Mexican peso).
- 9
BRL/USD (Brazilian real/US dollar).
Major currency pairs are the most widely traded globally. According to ESMA, they include any two of the following: US dollar (USD), euro (EUR), Japanese yen (JPY), British pound (GBP), and Canadian dollar (CAD). All other currencies are considered non-major. Exotic currency pairs involve one major currency and one currency from a smaller or emerging economy; they generally have higher volatility and higher spreads compared to major and minor pairs.
Table 1 provides a summary of the essential descriptive statistics for each currency pair, helping to capture the main characteristics of their price series: minimum and maximum value, mean and standard deviation, first (
) and third (
) quartile, skewness, and kurtosis.
For each currency, a dataset was compiled consisting of the open, high, low, and close values (open is the price at the start of the period, close is the price at the end of the period, high is the highest price traded during the period, and low is the lowest price traded during the period). All time series were normalized to the 0–1 range using the classic MinMax normalization. In every case, lagged values of the four time series were employed to forecast the closing price at time instance
. The optimal lags for each currency pair were identified through an exhaustive trial-and-error search for lag values up to 20 and can be found in
Table 2.
4. The EXPERT Model
The Transformer model, first introduced by Vaswani et al. [
2], revolutionized natural language processing (NLP) by outperforming recurrent neural networks (RNNs) and convolutional neural networks (CNNs) in tasks like machine translation. This architecture’s key innovation is the self-attention mechanism, which enables the model to capture long-term dependencies and analyze input sequences more comprehensively.
In time series forecasting, decision-making processes across sectors like finance, retail, and industry often rely on multivariate time series data. Traditionally, statistical models such as Vector Autoregression (VAR) [
19] and Autoregressive Moving Average (ARMA) [
20] have been used to forecast time series data. Recently, machine learning and particularly deep learning models such as RNNs and CNNs have been explored for this purpose [
21,
22]. In parallel, the Transformer model was tested as an alternative and promising approach for time series forecasting [
23]. Our study lies in the same methodological path.
4.1. Architecture Modifications and Network Structure
The overall architecture for time series forecasting draws inspiration from the Transformer Encoder structure but includes several adjustments, likely based on a combination of good practices in the literature and insights from studies like [
23].
Self-attention and layer normalization: Each encoder block starts with layer normalization, a technique that stabilizes the training process by standardizing the input across each layer, ensuring numerical stability and faster convergence. This is followed by multihead self-attention, a mechanism that enables the model to focus on different parts of the input sequence simultaneously. Unlike traditional models that process sequences step by step, self-attention allows the model to capture diverse temporal patterns by analyzing all positions in the sequence at once. For time series data, the self-attention layers are adapted to capture relationships between time points, focusing on temporal dependencies rather than word-to-word associations, as seen in natural language processing (NLP) models.
Residual connections and feedforward networks: To maintain the flow of important information through the network and address the vanishing gradient problem—where gradients become too small during training, impeding learning—residual connections are employed. These connections add the input of a layer directly to its output, preserving information from earlier layers. After the self-attention step, a feedforward neural network (FFN) processes the output, enabling the model to learn complex, nonlinear relationships within the time series. The FFN often includes convolutional layers that scan over input data and ReLU (Rectified Linear Unit) activations, which introduce nonlinearity and help the model detect intricate temporal patterns. This combination of techniques draws from the original Transformer model by Vaswani et al. [
2] and has been adapted for time series analysis [
23].
Regularization techniques: To prevent overfitting, where the model performs well on training data but poorly on unseen data, dropout regularization is applied. Dropout randomly “turns off” a fraction of neurons during training, forcing the model to learn more robust features. This regularization is used both after the feedforward layers and within the multilayer perceptron (MLP) used for forecasting. By improving the model’s ability to generalize, these techniques enhance its robustness in handling unseen time series data.
A comprehensive discussion of these terms is provided in
Section 4.4.
4.2. Output Processing and Forecasting
Once the input sequence has passed through the encoder blocks, the output is aggregated using global average pooling. This operation condenses the sequence into a fixed-length vector by summarizing information across all positions, making it easier for the model to focus on key features.
The pooled representation is then fed into an MLP for final processing. The MLP consists of a fully connected layer with Exponential Linear Unit (ELU) activation, followed by a linear output layer, which directly regresses the target values, suitable for continuous time series prediction. The linear activation function in the output layer is critical for regression tasks where the goal is to predict real valued outputs.
4.3. EXPERT Unique Features Architecture
The proposed EXPERT model builds upon the core concepts of the Transformer architecture but incorporates several key modifications to address the specific demands of time series forecasting, particularly in financial data. The original Transformer model, designed primarily for natural language processing (NLP) tasks, includes components such as positional encodings, a decoder, and causal masking that are unnecessary for time series forecasting. Below, we outline the unique aspects of our architecture, focusing on how the EXPERT model diverges from the standard Transformer model and why these changes are necessary for accurate forecasting.
4.3.1. No Positional Encoding
In the original Transformer, positional encoding is applied to account for the lack of inherent order information in input sequences. However, for time series data, where sequential order is inherent, this additional encoding is redundant. Therefore, the EXPERT model omits positional encodings, relying on the inherent structure of the time series to capture temporal relationships. This simplification reduces computational complexity while preserving the time-dependent characteristics of the data.
4.3.2. Encoder-Only Architecture
While the classical Transformer consists of both an encoder and a decoder, the EXPERT model uses an encoder-only structure, as forecasting tasks do not require output sequences to be generated (e.g., in machine translation). The encoder processes the historical time series data, and no decoder is necessary, as the output is a single future prediction rather than a sequence. This architectural choice focuses all learning capacity on extracting meaningful patterns from past data, which is critical for making accurate time series forecasts.
4.3.3. No Masking in Attention Mechanism
In NLP tasks, causal masking is applied in the decoder to ensure the model does not access future tokens when making predictions. However, since time series forecasting only involves predicting future values based on past data, the EXPERT model does not require masking. The attention mechanism is free to focus on any part of the input sequence, optimizing its ability to capture long-range dependencies and interactions within the historical data.
4.3.4. Global Average Pooling for Temporal Feature Aggregation
Our EXPERT model applies global average pooling (GAP) to aggregate the sequence of hidden states generated by the encoder. This aggregation provides a condensed representation of the entire time series, summarizing its overall trend and relevant features. GAP is well suited to time series tasks, as it reduces the sequence into a single feature vector that captures the most salient information for forecasting.
4.3.5. Use of Convolutional Layers in Feedforward Networks
While the original Transformer applies fully connected layers in the feedforward networks, the EXPERT model employs Conv1D layers to capture local temporal dependencies between adjacent time steps. Convolutional layers are more effective in extracting short-term patterns, which are crucial for tasks like financial forecasting where trends and relationships evolve over time. By using Conv1D, the EXPERT model is able to learn finer-grained local structures in the data while still maintaining the benefits of the multihead self-attention mechanism.
4.3.6. Customized Dropout Rates to Mitigate Overfitting
The EXPERT model incorporates custom dropout rates tailored to different layers of the architecture. Specifically, dropout is applied in both the encoder blocks and the multilayer perceptron (MLP) layers. This differentiation helps prevent overfitting, particularly when dealing with highly volatile financial time series data, where overfitting can lead to poor generalization performance. In contrast, the classical Transformer applies uniform dropout across layers, which may not be optimal for time series forecasting.
4.3.7. Data Normalization and Numerical Embedding
In contrast to the word embeddings used in NLP tasks, the EXPERT model applies MinMax scaling to normalize the numerical time series data. This normalization ensures that all input features are on the same scale, which is critical for stabilizing training and improving model performance when forecasting values that vary widely in magnitude. The use of this preprocessing step further highlights the model’s adaptation to the specific challenges posed by financial time series data.
These modifications demonstrate that the EXPERT model is uniquely optimized for time series forecasting, particularly for financial applications where patterns, trends, and long-range dependencies must be carefully captured. The encoder-only architecture and adjustments to the attention and feedforward layers enable the model to make accurate predictions of future exchange rates based on historical data.
4.4. The EXPERT Architecture
Initially, the EXPERT model receives input data in the form of historical currency exchange rate values. Each input sequence represents historical exchange rates for a specific currency pair over a period of time. This input data is passed through an embedding layer, converting it into numerical vectors that the model can understand. The embedded input sequences are then passed through multiple Transformer Encoder layers, each consisting of multihead self-attention mechanisms and feedforward neural networks. (“Self-attention” refers to the mechanism that allows the model to weigh the importance of different elements in the input sequence when computing a representation of that sequence. Each element can attend to other elements to capture dependencies and contextual relationships within the sequence. “Multihead” indicates that the self-attention mechanism is executed multiple times (in parallel but independently) with different sets of learned parameters (heads). Each head has access to different representation subspaces, allowing the model to capture diverse patterns or relationships in the data. The results from these multiple heads are then concatenated and combined to form a comprehensive representation of the input sequence.) After processing through these layers, the model generates output sequences.
The model is trained using historical currency financial data, adjusting its parameters to minimize the difference between its predictions and the actual data. Once the model is trained, it attempts to predict the next day’s closing price for the exchange rates mentioned in this paper.
The EXPERT architecture consists of the following components (see
Figure 1).
4.4.1. Embedding Layer
The embedding layer initiates with data normalization, a technique essential for stabilizing the training process by standardizing the input data. This is achieved using layer normalization, which normalizes the input data as follows:
where
and
is a small constant for numerical stability, and
d is the dimension of the input features.
4.4.2. Encoder
The core of the EXPERT model is the encoder, which takes the input time series data and transforms it into a sequence of hidden states. This is performed using a stack of encoder blocks. Each encoder block consists of two sublayers:
Self- attention layer: This layer allows the model to learn long-range dependencies in the input data. A key part of this is the residual connection, which ensures that the input is passed forward while also allowing the model to learn relationships:
Feedforward network: The feedforward network allows the model to learn nonlinear relationships between the input data and the output value. Mathematically, this is computed as
where
,
,
,
and
d represents the dimensionality of the input and output of the FFN. It is typically the hidden size of the input sequence after processing by the encoder. It remains consistent throughout the architecture.
f represents the dimensionality of the intermediate layer within the FFN. This is usually larger than
d, providing the network with a greater capacity to model complex transformations.
4.4.3. Global Average Pooling
The global average pooling layer takes the sequence of hidden states from the encoder and converts it into a single vector. This vector represents the overall trend of the input time series data. The operation is defined as
where
n is the sequence length. The pooling operation compresses the sequence, allowing the model to focus on global trends.
4.4.4. Multilayer Perceptron (MLP)
The MLP takes the output of the global average pooling layer as input and predicts the output value. Each layer of the MLP performs the following transformation:
where
and
are the weights and biases, and ELU is the activation function applied at each layer.
4.4.5. Output Layer
The final output layer applies a linear transformation to predict the output value. This is represented by the following equation:
where
and
are the weights and biases of the output layer.
4.4.6. Encoder-Based Model—Hyperparameter Values
The optimization of the benchmarks in this study was focused on hyperparameter tuning to improve model performance. Key hyperparameters considered for optimization included the number of attention heads, feedforward network dimension, the number of Transformer blocks, learning rate scheduling, and dropout rates. These parameters were chosen due to their significant impact on the performance of Transformer-based models.
The corresponding value ranges for each hyperparameter are delineated as follows:
Number of attention heads: Tested values ranged from 2 to 60 heads.
Feedforward network dimension: It varied between 128 and 1024 units.
Number of Transformer blocks: The number of Transformer layers varied between two and seven blocks.
Learning rate: A custom learning rate scheduler was implemented to progressively increase the learning rate during the initial warm-up period (30 epochs), followed by gradual decay over 100 epochs. The base learning rate was set at , with a minimum learning rate of .
Dropout rates: Applied to both the Transformer layers and the fully connected layers, dropout rates ranged between 0.1 and 0.48 to prevent overfitting.
The search for optimal hyperparameters was carried out using a combination of grid search and manual tuning, informed by early experimental results and prior knowledge. Grid search was employed for discrete parameters such as the number of attention heads and Transformer blocks, while a more manual approach was applied for parameters like learning rate and dropout, as these often required finer control during training iterations. To find the most effective lag length, we gradually increased the number of lags from 0 to 30, noticing that performance improved up to a certain point. Beyond that, adding more lags made the results worse.
The implementation leveraged the TensorFlow and Keras libraries for deep learning, with specific reliance on Keras’s Sequential API and the MultiHeadAttention and Conv1D layers for constructing EXPERT model architecture. The MinMaxScaler from scikit-learn was used for feature scaling, and early stopping with learning rate scheduling was implemented using Keras callbacks to prevent overfitting and optimize the learning process.
Table 2 presents the hyperparameter configurations used for different currency pairs in our EXPERT models. While the general structure of the model remains consistent across different setups, key parameters such as the number of attention heads, feedforward dimension, and batch size vary depending on the specific dataset. The proposed architecture consists of multiple stacked Transformer blocks, each incorporating a self-attention mechanism with varying head sizes and feedforward layers. The MLP layer contains 256 units for all currency pairs. To prevent overfitting, dropout is applied to both the MLP layers and the encoder blocks, with an MLP dropout rate varying per experiment, while the encoder dropout remains fixed at 0.1 for all cases. A global average pooling layer aggregates sequence-level representations, which are subsequently passed through fully connected layers to generate the final forecasting output. The model is trained using adaptive optimizers, such as ADAM and ADAMW, with batch sizes optimized for each currency pair to ensure robust performance.
4.5. Rationale for Choosing the EXPERT Model
Our model is designed to forecast currency exchange rates by capturing both broad long-term trends and short-term fluctuations that shape financial data. A key component is the self-attention mechanism, which enables the model to analyze the entire historical data sequence, identifying how past events may continue to influence current market behavior. At the same time, the Conv1D layers focus on local details, such as the daily price movements typical in exchange rate series. Next, the global average pooling layer summarizes the extracted features into a concise representation that reflects the overall data direction. Lastly, custom dropout rates are applied during training to enhance the model’s adaptability and robustness when encountering new, unseen data.
9. Conclusions
In this study, we introduced EXPERT (EXchange rate Prediction using Encoder Representation from Transformers), a Transformer-based framework for forecasting foreign exchange rates. By adapting the encoder-only Transformer architecture to time series data, EXPERT effectively captures both long-term dependencies and short-term dynamics inherent in financial series.
Empirical evaluation across nine major, minor, and exotic currency pairs over more than two decades (1999–2022) demonstrated that EXPERT consistently outperforms traditional econometric, ensemble, and deep learning benchmarks—including random walk, Linear Regression, Random Forest, XGBoost, and LSTM—on various performance metrics (namely, MAPE, MAE, and MSE). Its forecasting advantage is statistically supported by Diebold–Mariano tests, confirming its robustness and reliability across diverse market conditions.
From an academic perspective, this proposal contributes to the literature on attention-based deep learning for time series forecasting. It shows that the encoder-only Transformer is able to model temporal dependencies through self-attention rather than sequential recurrence and can outperform recurrent architectures such as LSTM in terms of accuracy and efficiency. The proposed design choices pave the way for applying Transformer architectures to other time series domains.
From a practical standpoint, EXPERT offers a scalable and data-efficient forecasting tool suitable for algorithmic trading, portfolio risk management, and policy analysis. Its high accuracy, adaptability to longer horizons, and consistent performance across currency types make it valuable to both financial institutions and decision-makers managing exchange rate exposure.
Future research could extend this work by incorporating macroeconomic indicators, textual or sentiment data, and probabilistic forecasting.