1. Introduction
Time series play a crucial role in capturing data variation over time, a feature that is especially important in industry and finance analyses [
1]. The advent of artificial intelligence (AI) has transformed time series analysis and forecasting, enabling more accurate and efficient predictions across diverse domains [
2]. By leveraging machine learning (ML) and deep learning (DL) techniques, AI can model complex temporal dependencies and patterns within data. These methods are widely applied in areas such as stock prices forecasting, engineering problem-solving, and industrial process optimization [
3]. The capacity of AI to process large datasets and learn nonlinear relationships has made it a powerful and reliable tool for identifying trends and anomalies in time series data, ultimately supporting more robust decision-making and operational efficiency [
4].
By nature, some time-series data exhibit high levels of fluctuation, making them difficult to analyze and model. Large variations and irregular patterns can introduce significant and unpredictable changes, complicating the modeling process [
5]. Additionally, time-series data often contains outliers, abrupt shifts, or inconsistent patterns across observations. These irregularities can arise from various sources such as external shocks, measurement errors, or structural changes in the underlying system. Moreover, time-series data typically involves strong temporal dependencies, where current values are influenced by historical observations in complex ways. This temporal structure must be carefully captured to ensure accurate forecasting. Given these challenges, any model proposed for time-series prediction must be capable of identifying underlying patterns in the data while effectively handling noise, irregularities, and nonstationary behavior [
6]. Traditional statistical models often fall short in addressing such complexity, highlighting the need for more flexible and adaptive approaches such as ML and DL, which can learn from data directly without relying on rigid assumptions [
7].
Many approaches have been proposed for forecasting time series using ML and DL models, including Recurrent Neural Networks (RNNs), Long Short-term Memory (LSTM), Gated Recurrent Unit (GRU), and, more recently, Transformers. RNNs were a major research focus during the 1990s [
3,
8], developed specifically to process and analyze sequential data such as time series and language models. Unlike traditional feedforward neural networks, RNNs maintain a form of memory by passing information from one hidden state to the next, enabling the network to capture temporal dependencies across time steps. As a result, the output of an RNN is influenced not only by the current input but also by the sequence of prior inputs. This ability to model sequential relationships allows RNNs to capture long-term trends and complex patterns in time-series data, making them a powerful tool for forecasting tasks [
3].
However, RNN models suffer from issues such as gradient explosion and vanishing gradient [
9], which hinder their ability to learn long-term dependencies. To mitigate these problems, techniques like gradient clipping have been proposed [
10]. A more robust solution emerged with LSTMs, which were specifically designed to address the limitations of standard RNNs [
11]. In conventional RNNs, gradients can either become very small (vanish) or grow excessively large (explode) during backpropagation through time, making training unstable or ineffective. LSTMs introduced gating mechanisms—namely, the forget, input, and output gates—that control the flow of information through the network. These gates enable the model to retain or discard information selectively at each time step. In addition to the hidden state, LSTMs use a cell state, which acts as a memory buffer that helps preserve relevant historical information over long sequences. This architecture allows LSTMs to effectively capture long-term dependencies, making them more efficient and reliable for time-series forecasting compared to traditional RNNs [
12].
Although LSTMs are effective at capturing long-term dependencies, their complex architecture, consisting of three gates (input, output, and forget) and a memory cell, makes them computationally expensive. To address this issue, Bengio et al. [
13] introduced an RNN Encoder–Decoder architecture, which later evolved into the GRU. GRUs simplify the LSTM structure by using only two gates: the update gate, which determines how much information should be carried forward, and the reset gate, which controls how much past information should be forgotten. These two gates regulate the flow of information while maintaining a simpler structure, resulting in reduced computational cost compared to LSTMs [
14]. Numerous studies comparing GRU and LSTM models have shown that GRUs often achieve similar or even superior performance to LSTMs [
15,
16], making them an efficient and competitive choice for sequence modeling tasks such as time-series forecasting [
17].
RNNs, LSTMs, and GRUs process data sequentially, where each input is handled one time step at a time, with each step depending on the hidden state from the previous time step. This structure enables the models to effectively capture temporal dependencies within the data. However, the sequential nature of these models also limits their efficiency, as each computation must wait for the previous one to complete, leading to longer training times and limited parallelization. In contrast, Transformers represent a paradigm shift in sequence modeling. Introduced by Vaswani et al. [
18], Transformers eliminate the need for sequential processing by leveraging a self-attention mechanism that captures dependencies between all elements in a sequence simultaneously. This architecture allows for efficient parallel computation and significantly improves the ability to model both short- and long-term dependencies. To compensate for the absence of recurrence, Transformers incorporate positional encoding, which provides information about the order of elements in the sequence.
The Transformer architecture was initially introduced for Natural Language Processing (NLP) tasks [
19], but it has since demonstrated remarkable adaptability across a wide range of domains. In recent years, the potential of Transformers for time-series prediction has gained significant attention, largely due to their ability to effectively model long-range dependencies. Several studies have highlighted the superiority of Transformers in capturing complex temporal patterns. For example, Lim et al. [
20] proposed a novel DL architecture called the Temporal Fusion Transformer (TFT), designed to improve both the accuracy and interpretability of multi-horizon time-series forecasting. A key innovation in TFT is the introduction of a Gated Residual Network (GRN), a flexible gating mechanism that allows the model to dynamically switch between linear and nonlinear processing pathways. To evaluate its performance, TFT was tested on multiple datasets, including traffic [
21], electricity [
22], retail, and financial volatility forecasting datasets, and consistently outperformed other benchmark models across all four domains.
In another study, Zhuo et al. [
23] introduced a novel Transformer-based model called Informer, specifically designed for long-sequence time-series forecasting (LSTF). The model addresses several inefficiencies in traditional Transformer architectures, including scalability challenges, the complexity of the self-attention mechanism, high memory consumption, and limitations of the encoder–decoder structure. To overcome these issues, Informer introduces a ProbSparse self-attention mechanism, which selects only the most informative queries for attention computation, significantly reducing computational overhead. Additionally, Informer employs a Distilling Operation that compresses repetitive temporal patterns, allowing the model to focus more effectively on critical patterns during training. These innovations make Informer a highly efficient and scalable solution for LSTF tasks. The model’s superiority was demonstrated across various benchmark datasets, positioning Informer as a powerful tool for time-series prediction capable of capturing complex long-range temporal dependencies more effectively than traditional models.
LSTMs, GRUs, and Transformers have gained significant popularity for time-series prediction due to their ability to capture complex temporal dependencies and patterns. However, achieving optimal performance involves more than just the choice of model architecture. It also heavily relies on selecting an appropriate loss function. Loss functions play a crucial role in training by quantifying the discrepancy between predicted and actual values, guiding the model to minimize errors during backpropagation. While commonly used loss functions such as Mean Absolute Error (MAE) and Mean Squared Error (MSE) are effective in many situations, they may underperform in cases involving highly volatile data or abrupt changes over time. To better handle such challenges, researchers often design customized loss functions tailored to the specific complexities of their prediction tasks. These specialized functions can enhance model sensitivity to sudden variations and improve overall predictive accuracy in dynamic and irregular time-series environments.
One notable example of a customized loss function is the quantile regression loss function, also known as the pinball loss function, proposed by Koenker and Bassett [
24]. This loss function asymmetrically penalizes underestimations or overestimations depending on the quantile of interest, enabling the model to focus on specific parts of the target distribution, such as the median, lower, or upper tails. Wang et al. [
25] leveraged the pinball loss function in developing the Probabilistic Quantile Multiple Fourier Feature Network (QMFFNet), designed to forecast the temperature of Qinghai Lake. Given the inherent volatility of meteorological time series, conventional models often struggle to maintain accuracy. By incorporating the pinball loss, QMFFNet effectively captured the stochastic nature of the data. Experimental results demonstrated that QMFFNet outperformed baseline models, including MLP, LSTM, and GRU, highlighting the effectiveness of the pinball loss in managing complex, variable forecasting tasks. In a separate study, Kang et al. [
26] applied the pinball loss function to the domain of electrical load forecasting. This work compared the performance of LSTM, RNN, and Gradient Boosting Regression Tree (GBRT) models. Among them, the LSTM model guided by the pinball loss achieved the highest predictive accuracy. Beyond improving forecast precision, this approach also enabled the generation of probabilistic forecasts, offering valuable insights into potential fluctuations in future electricity demand.
To further address the limitation of predicting peaks in high-variability time-series datasets through ML and DL with traditional loss functions such as MSE and MAE, this study introduces a novel loss function: the enhanced peak (EP) loss function. The EP loss function is specifically designed to reduce underestimation and overestimation errors at extreme values, thereby improving the accuracy of peak predictions. To assess the effectiveness of the EP loss function, a comprehensive experimental evaluation was conducted, benchmarking its performance against the pinball loss function and MSE, which serves as the baseline. Three complex and diverse time-series datasets were used in the evaluation: (i) a NOx emission dataset tracking nitrogen oxide (NOx = NO + NO2) emissions in a forested region of Iowa; (ii) a streamflow dataset comprising streamflow and precipitation measurements from the Iowa River Basin; and (iii) a financial dataset containing gold price data spanning from 1 January 2008 to 31 December 2023. Two ML models were employed: a GRU model for the NOx emissions dataset and a Transformer model for both the streamflow and gold price datasets. This multi-dataset, multi-model evaluation framework provides a robust assessment of EP’s generalizability and performance across varied time-series forecasting tasks.
The primary contribution of this study is the introduction of a novel loss function, specifically designed to enhance forecasting accuracy at peak values in high-variability time series datasets. Unlike the pinball loss function, which targets a specific quantile of the data distribution, the proposed EP loss function applies additional penalties to errors that exceed a defined threshold, thereby focusing more directly on peak prediction performance. Another distinguishing feature of the EP loss function is its separate treatment of underestimations and overestimations at peak values, regions where the largest discrepancies between predicted and actual values typically occur. By incorporating distinct penalty terms for over- and under-predictions, the EP loss function allows for asymmetrical error handling, offering more flexibility and precision than traditional symmetric loss functions. This threshold-based formulation enables the EP loss function to more effectively capture and penalize critical prediction errors, especially at extreme values. Empirical results from this study demonstrate that the EP loss function outperforms the pinball loss function in forecasting peaks. The efficacy of this new loss function is validated using real-world datasets, highlighting its potential as a robust tool for time series forecasting tasks involving abrupt changes and extreme values.
This paper is organized as follows.
Section 2 introduces the GRU and Transformer models utilized in this study, along with a detailed description of the proposed EP loss function and the benchmark pinball loss function.
Section 3 presents the datasets employed in this research, followed by model performance assessment. Finally,
Section 4 concludes the paper and outlines directions for future work.
2. Methodology
This section outlines the methodology employed in this research. To address the limitations of RNNs in capturing long-term dependencies, this study utilizes GRU and Transformer models, two widely used approaches for time series analysis. The GRU model is applied to the NOx emission dataset, while the Transformer model is used for streamflow and financial datasets.
GRU models, which employ two gating mechanisms, can effectively capture both short- and long-term dependencies while being less computationally expensive than LSTMs. This efficiency makes GRUs particularly well-suited for time series prediction tasks, where computational resources and model simplicity are often key considerations. Transformers, on the other hand, represent a more recent innovation in time series analysis. Originally introduced for language modeling, Transformers differ from earlier models in their ability to process entire sequences simultaneously and prioritize inputs using an attention mechanism, without inherently considering sequential order. This characteristic gives Transformers a significant advantage over traditional models, particularly for tasks that require identifying complex dependencies. However, the sequential order of data is critical in time series analysis. To address this, positional encoding is incorporated into Transformer models, allowing them to preserve the sequential structure of the data while leveraging the benefit of self-attention.
To evaluate and compare the performance of the proposed loss function, EP, with the existing loss functions, such as pinball and Huber loss functions, conventional loss metrics such as MSE and MAE are used as baselines. Since only regression problems are considered in this study, the primary evaluation metrics are the coefficient of determination, denoted as the score, and the mean absolute percentage error (MAPE).
2.1. GRU
The GRU model was introduced by Bengio et al. [
9]. Like the LSTM model, GRUs control the flow of information using gates; however, GRUs are simpler because they use one fewer gate than LSTMs, omitting the output gate. The two gates in a GRU are the reset gate and the update gate. With fewer gates, GRUs require fewer parameters, making them computationally more efficient and easier to train. The basic architecture of GRU cell is demonstrated in
Figure 1. The typical mathematical formulation of a GRU is as follows:
where
represents the update gate at time step
t,
is the reset gate vector,
denotes the candidate activation vector, and
is the final hidden state at the time step
t. Furthermore,
and
represents sigmoid and hyperbolic tangent functions.
W and
R are weight matrices, and
,
, and
are bias vectors. The inputs to the GRU cell in time step
t is
and
while the output is
that will be fed to the next cell as the input.
GRUs, as RNN-based models, compute hidden states sequentially, with each state depending on the previous one, . This step-by-step processing enables GRUs to model temporal dependencies; however, it also limits their ability to capture very long-range relationships due to constrained memory capacity.
2.2. Transformers
To address the aforementioned challenges, Vawsani et al. proposed a novel model called the Transformer [
18]. By eliminating recurrence and instead leveraging an attention mechanism, Transformers can capture global dependencies across input and output sequences while enabling parallel computation. The attention mechanism is the core component that allows the model to dynamically focus on different parts of the sequence. This design significantly improves efficiency and scalability, especially for long sequences. The Query, Key, and Value components are calculated by multiplying the input to
,
, and
, respectively, where
,
, and
are learned weight matrices. The attention mechanism determines the relevance between the queries and keys passed through the softmax function to produce probabilities. The output, which is the weighted sum of the values, is computed by a dot product in Equation (
7).
Figure 2 demonstrates the architecture of the transformer model.
Additionally, the transformer employs multi-head attention (Equation (
8)), which improves the model’s ability to focus on different parts of the input sequence. Each head
h is calculated using the attention equation, and then the outputs of all heads are concatenated by multi-head attention. Equation (
9) defines the Feed-Forward Network, which is a crucial part of the encoder–decoder of transformers. It follows the multi-head attention and consists of two fully connected layers with a ReLU activation function between them. It is applied independently to each layer and helps capture non-linear relationships in the data. Equation (
10) facilitates stable gradient flow during backpropagation and contributes to keeping the original input information.
Transformers process input tokens in parallel, which means that they do not prioritize order. This matter could become problematic in cases where the order is important, such as time series. Therefore, in addition to all the equations mentioned above, positional encoding is also used in the model. The positional encoding provides the model with the information, which enables the model to differentiate tokens based on their positions, ensuring that order-sensitive tasks, such as time series, are handled effectively. In the sinusoidal positional encoding, each position in the sequence is transformed into a vector using sine and cosine functions computed at multiple frequencies. This design enables the model to capture both absolute and relative positional information. The mathematical equation of positional encoding is provided in Equations (
12) and (
13), where
denotes the position of the token in the sequence,
i represents the index of the vector dimension, and
indicates the dimensionality of the embedding.
2.3. Loss Functions
One of the key challenges in time series forecasting is making accurate predictions over data with high variability. Such data often exhibit sudden, large, and frequent changes over time, which complicates analysis and modeling. In these cases, adopting custom loss functions specifically designed to handle high-variability sequences can lead to significant improvements in performance. The pinball loss function [
24] is one such approach, introduced to address this issue. By incorporating a quantile parameter, the pinball loss targets specific quantiles of the data distribution, making it especially useful for quantile regression. Equation (
14) presents the mathematical formulation of the pinball loss function.
where
y is the actual value,
Y is the predicted value by the model, and
q is the quantile variable that we give to the model as the input, and
N is the number of samples.
The performance of the pinball loss function depends on the specified quantile. For instance, when , the loss behaves symmetrically and emphasizes minimizing the absolute error. However, if , the focus shifts toward penalizing underestimations more heavily while giving less attention to overestimations. In contrast, the proposed EP loss function addresses underestimations and overestimations separately, allowing the model to apply more targeted penalization, particularly around peak values.
In order to observe how well the EP loss function is performing on peak events, we also employed the Huber loss function [
27] in our systematic comparison. The Huber loss function provides a balance between the sensitivity of the Mean Squared Error (MSE) to large errors and the robustness of the Mean Absolute Error (MAE) to outliers. It behaves quadratically for small residuals and linearly for large ones, controlled by a threshold parameter
. This makes the Huber loss less sensitive to outliers than MSE while still maintaining smooth gradients, improving optimization stability. It is widely used in regression and time-series forecasting, where occasional extreme errors should not dominate the loss landscape. Equation (
15) represents the Huber loss function.
Unlike the Huber loss, whose asymmetric behavior is determined solely by the magnitude of the residual, EP introduces a value-dependent asymmetry that becomes active specifically during peak regions of the target signal. This allows EP to apply proportional penalties for underestimation as the true value increases, ensuring that the model gives appropriate emphasis to high-intensity events that are often the most important in highly variable time series. Furthermore, EP includes a dedicated term for overestimation that enables controlled penalization of overestimation during peak events as well, providing flexible and balanced treatment of both types of errors in critical regions of the signal.
The EP loss incorporates three parameters: a threshold, which determines when penalization begins, and two penalization factors, which control the intensity of penalties for under- and overestimation. These two penalties, which we call the Underestimation factor and the overestimation factor, become activated when the model is faced with the corresponding scenario. To be more specific, the underestimation term triggers when the predicted value is lower than the actual value, and the overestimation term gets activated in the opposite condition. This conditional structure enables the loss function to adaptively penalize the error in a direction-aware manner. This structure encourages predictions to align more closely with actual values, thereby enhancing forecasting accuracy. The mathematical formulation of the EP loss function is presented below:
where
is the underestimation factor,
denotes the overestimation factor,
T represents the threshold, and 1(.) is the indicator function (1 if the condition is true, 0 otherwise), used to activate penalty terms under specific conditions. The hyperparameters associated with the EP loss function were tuned using a grid search. This systematic exploration of the search space allowed us to identify the configuration that produced the best predictive performance while maintaining stability across multiple experiments.
To evaluate the effectiveness of the custom loss functions represented above, we employ MSE and MAE baselines. The model’s performance is compared using both these baseline loss functions and the custom loss functions. The mathematical formulations of MSE and MAE are given below:
2.4. Evaluation Metrics
To evaluate the performance of the model, the coefficient of determination (
score) was selected as the primary metric. In general, the
score measures how well the model’s predictions capture the variance in the actual values. It typically ranges from 0 to 1, with higher values indicating better predictive performance. In addition to the
score, the RMSE and Peak Recall [
28] were also employed to assess the performance of the different loss functions. RMSE measures the average magnitude of the prediction error; it is sensitive to outliers and is therefore particularly useful when large deviations are undesirable. Smaller RMSE values indicate more accurate predictions. Peak Recall quantifies how effectively the model captures peak events above a specified threshold and measures the proportion of true peaks that the model successfully predicts. Moreover, the MAPE was employed to evaluate the model’s performance on the financial dataset. MAPE quantifies prediction accuracy as a percentage of the actual values, making it scale-independent and easily interpretable across different datasets. Recent studies have highlighted the importance of reporting uncertainty when evaluating machine learning models, as results from running the model for a single time may be sensitive to randomness in initialization, data shuffling, or training dynamics. Following the guidance of Rainio et al. [
29], each experiment in this work was repeated 8–10 times with different random seeds, and we report the mean performance along with its standard deviation. This practice provides a more reliable representation of model stability and allows for statistically meaningful comparison across different loss functions. The mathematical formulations of the
score and MAPE, RMSE, and Peak Recall are provided below:
Herein, τ denotes the threshold used for calculating Peak Recall. Because the objective of this study is to evaluate model accuracy specifically on peak events, we selected a consistent threshold of equal to 90 across all experiments. This allows us to focus on the highest 10% of the target distribution, which corresponds to the peak-value regime of greatest interest. Using a fixed and consistent ensures fair and comparable evaluation of peak-event performance among all models. In this study, model training and testing were carried out using Python 3.9.21 as the programming language. The experiments were conducted on a machine equipped with an Intel Core i7-6700HQ processor, an NVIDIA GeForce GTX 960M graphics card, and 16 GB RAM.
4. Conclusions and Outlooks
In this study, time series datasets with high variability were investigated. A new custom loss function named Enhanced Peak Loss function (EP) was proposed for improving prediction over peak forecasting. By acquiring two independent factors for underestimations and overestimations, EP can penalize inconsistencies with the actual values separately. To evaluate this loss function, the results from MSE and MAE were chosen as the baseline, and EPL’s performance was compared to the pinball loss function, which can improve underestimation or overestimation by adjusting its quantile value.
To demonstrate the advantage of the EP, their performance was measured over various datasets, including the NOx emission dataset, streamflow dataset, and Gold Price financial dataset. This collection of datasets was initially selected to span multiple application domains, such as environmental, hydrology, and finance to evaluate EP on diverse time series settings. The mentioned datasets are considered highly variable, meaning that the data is significantly volatile over time steps. This condition makes it demanding for models to capture temporal dependencies effectively. Our purpose was to introduce the EP loss function to overcome this issue. The EP loss function penalizes underestimation and overestimation after passing a specific threshold. This condition would promote the model’s ability to capture temporal dependencies over time steps, resulting in a more robust forecast.
EP was able to perform better than MSE, pinball, and Huber loss functions in all instances except one that Huber loss performed slightly better. Nevertheless, the difference was negligible. One reason for this is that pinball and Huber both have only one hyperparameter that limits their capability to treat underestimations and overestimations differently over peak values. EP, on the other hand, acquires a threshold and two terms for underestimation and overestimation, and can conform itself better in order to yield the best outcome. The prediction results prove the superiority of the EP over the other loss functions. The ability of EP to reduce error accumulation over time steps, while being tested on datasets from various domains, underscores its utility for real-world forecasting tasks, such as financial market predictions or energy demand modeling. This highlights its potential for broader adoption in domains requiring high temporal accuracy.
In spite of EP’s capability to conform itself dynamically to sharp peak variations, one of its limitations is the wide range of values that can be chosen for the hyperparameters. Since there is no range of values for the underestimation and overestimation factors, the model should be trained with various sets of hyperparameters in order to achieve the optimum results, which can be time-consuming in some cases.
Our future work involves validating the EP loss function’s applicability over more diverse datasets and enhancing its predictive ability not only on peak values but also across the entire dataset in order to reach ultimate forecasting performance.