Temperature Prediction Using Transformer–LSTM Deep Learning Models and Sarimax from a Signal Processing Perspective

Kişmiroğlu, Celalettin; Isik, Omer

doi:10.3390/app15179372

Open AccessArticle

Temperature Prediction Using Transformer–LSTM Deep Learning Models and Sarimax from a Signal Processing Perspective

by

Celalettin Kişmiroğlu

and

Omer Isik

^*

Department of Electrical-Electronics Engineering, Istanbul Arel University, Istanbul 34537, Türkiye

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(17), 9372; https://doi.org/10.3390/app15179372

Submission received: 18 July 2025 / Revised: 16 August 2025 / Accepted: 21 August 2025 / Published: 26 August 2025

Download

Browse Figures

Versions Notes

Abstract

Recent developments in machine learning (ML), deep learning (DL), and statistical signal processing have led to substantial improvements in the accuracy of time series forecasting, particularly for environmental parameters such as temperature. The accuracy of air temperature prediction is not only vital for meteorological forecasting but also critically impacts agriculture, energy management, and environmental monitoring. In this study, a comprehensive modeling approach is proposed by incorporating both data-driven learning methods and classical signal processing techniques. Specifically, statistical models such as Seasonal AutoRegressive Integrated Moving Average with eXogenous regressors (SARIMAX) are evaluated alongside modern neural network architectures, including Long Short-Term Memory (LSTM) networks and Transformer-based attention mechanisms. The implemented models utilize key atmospheric variables—humidity, pressure, and past temperature values—to predict ambient temperature for future time horizons, such as one week and six months ahead. The SARIMAX model, which is grounded in digital signal processing theory, is particularly examined for its ability to capture seasonality and trend components in structured data. Meanwhile, deep learning models excel in learning complex, nonlinear temporal dependencies. Experimental results show that while LSTM performs well in short-term predictions (mean absolute error (MAE): 2.27, mean squared error (MSE): 6.63), the attention-based Transformer model is superior in capturing the predictions in the long term (MAE: 2.99, MSE: 14.92). SARIMAX, on the other hand, demonstrates a reliable performance in the short term compared to LSTM. These findings provide valuable insights into the strengths and limitations of each modeling approach, guiding future efforts in temperature forecasting and time series analysis.

Keywords:

temperature prediction; digital signal processing; LSTM architecture; transformer architecture

1. Introduction

Temperature and ambient temperature are influenced by factors such as temperature, humidity, pressure, wind speed, and the amount of heat radiated by the sun onto the surface. Nowadays, highly accurate predictions can be made in meteorological forecasts using various features [1,2]. Despite these highly accurate predictions, further improvements in accuracy are necessary, particularly for applications in agriculture, environmental monitoring, and public health. In this study, the objective is to develop and compare machine learning models and signal processing methods for temperature prediction by utilizing fundamental meteorological parameters such as temperature, humidity, and pressure. This study aims to assess the effectiveness of two advanced deep learning architectures, Long Short-Term Memory (LSTM) and Transformer models, as well as the Seasonal AutoRegressive Integrated Moving Average with eXogenous (SARIMAX) model of signal processing, in capturing temperature variations over time [3].

While temperature prediction is frequently used in meteorological forecasts and to monitor environmental conditions, machine learning, deep learning, and signal processing methods have contributed to more precise weather and ambient temperature predictions [4,5]. Due to the adverse effects caused by global warming, daily and hourly temperature predictions have become particularly important in agriculture, ecology systems, biological systems, and the health sector, especially for their uses in greenhouses, crop management, and disaster planning [6,7,8].

As is known, factors such as temperature, pressure, and humidity, which affect weather conditions and ambient temperature, can be considered as time series data. Therefore, to predict temperature, it is necessary to learn temperature patterns by considering the relationship of the data with humidity and pressure [9]. Nowadays, there are many algorithms available to learn the pattern of data in greenhouse temperature and environment temperature applications, such as Support Vector Machine (SVM), Random Forest (RF), Artificial Neural Network (ANN), and SARIMAX [3,10]. Transformers, LSTM architecture, and SARIMAX have gained widespread adoption in time series forecasting, sequential data modeling, and seasonality, outperforming traditional machine learning models such as SVM, RF, and ANN. LSTM and Transformer models are specifically designed to handle sequential data, making them highly effective for time series prediction and other tasks requiring temporal awareness. Unlike SVM and Random Forest, which treat data as independent instances, LSTM utilizes gated memory mechanisms to retain past information, while the Transformer architecture employs self-attention mechanisms to capture long-range dependencies [11]. ANN, though capable of learning complex patterns, struggles with sequential dependencies due to its lack of an inherent memory mechanism [12]. On the other hand, SARIMAX is an example of a statistical digital signal processing method. It models the time series using linear relationships among observations and incorporates trend and seasonal structures through different and seasonal autoregressive components [13].

Temperature forecasting is a complex task that necessitates accurately capturing temporal dependencies, nonlinear patterns, and complex correlations present in time series data. Traditional machine learning algorithms, such as SVM and RF, primarily rely on handcrafted feature extraction and often fail to effectively model long-term dependencies in sequential data due to their inherent limitations in processing temporal relationships. Although these models may perform well in short-term forecasting scenarios, their ability to generalize across extended time horizons remains limited, making them less suitable for applications that require high accuracy over long periods [10].

In this study, a Transformer architecture based on the basic principles of the Recurrent Neural Network (RNN) model, specifically the Long Short-Term Memory (LSTM) algorithm, the Natural Language Processing (NLP) algorithm, and the Seasonal AutoRegressive Integrated Moving Average with eXogenous (SARIMAX) model of statistical signal processing, is used [13,14,15].

Since LSTM and Transformer models inherently have hidden layers and dimensions, activation functions are used to increase nonlinearity [16,17]. SARIMAX is a linear model with no hidden layers or activation function. It relies on autoregressive, integrated, and moving average components for temporal dependencies [13].

In this study, while predicting temperature, LSTM, Transformer, and SARIMAX models are applied to a dataset including temperature, humidity, pressure, and time features. Then, a six-month temperature prediction is conducted to evaluate the performance of the models. The initial one-week interval is considered as a short-term forecast, while the complete six-month prediction is treated as a long-term forecast. This distinction allows for a more comprehensive assessment of each model’s ability to capture both immediate fluctuations and extended temporal patterns. Each model is examined to ascertain which period exhibits the highest error rate and which period yields the most favorable findings. It is expected to predict increasing and decreasing temperatures during sunrise and sunset with as little error as possible. Therefore, the architecture used aims to increase sensitivity to temperature fluctuations throughout the day by reducing linearity. Data was taken from an official and licensed Kaggle study, including hourly temperature, pressure, humidity, and windspeed, and more data was measured in Hungary between 2006 and 2016 years. Even though the data includes wind speed, visibility, and wind bearing, the first stage considers basic physical quantities like time, temperature, humidity, and pressure in order to see their compatibility with the algorithm.

2. Materials and Methods

2.1. Data Preprocessing

The data used in this study was obtained by sampling temperature, humidity, and pressure sensors every hour for a period of 10 years and 5 months between 2006 and 2016 years in Hungary. There are positive or negative floating-point values in the dataset. These numbers must be within a certain range, yet the applied models can turn these floating-point values into numbers which can cause more errors. In total, there are 96,453 records available in the data, and missing or incorrect values are not included in the data. In this study, time, temperature, humidity, and pressure are considered, while the rest of the features are not included in work. The features not included in work were eliminated by a data preprocessing step using Python (v3.8.6) code. In this study, we utilized the Weather in Szeged 2006–2016 dataset. The dataset is publicly available at https://www.kaggle.com/datasets/budincsevity/szeged-weather/data (accessed on 1 April 2024).

In machine learning models, data preprocessing is a critical step that involves transforming raw data into a suitable format for modeling. Proper preprocessing ensures that the data is clean, consistent, and suitable for machine learning algorithms, ultimately improving the performance of models. Prior to model implementation, the dataset is chronologically sorted to ensure temporal consistency, as the raw data exhibited an irregular date sequence. The dataset is partitioned into training and testing subsets, with 95% of the data allocated for training and the remaining 5% reserved for testing purposes. The test data includes about the last six months of raw data. The preprocessing stage can be broken down into several key processes; the most frequently used data preparation processes are normalization or standardization. Data preprocessing plays a crucial role in enhancing the performance of machine learning models, particularly in time series forecasting applications. Normalization is one of the most commonly used methods due to its ability to scale numerical features to a standardized range, typically between [0, 1] or [−1, 1]. In this study, [0, 1] transformation is used because of the selection of normalization in Equation (1). This transformation is essential for improving the efficiency and stability of deep learning models, including Long Short-Term Memory (LSTM) networks and Transformer architectures. Equation (1) shows the parameters [18].

N(x) = (x − µ(x))/(σ(x))

(1)

In the normalization process mentioned Equation (1), the expression µ(x) represents the mean of feature x, while σ(x) represents the standard deviation of feature x [19]. Temperature, pressure, and humidity values are features, and normalization must be applied to every feature separately.

The normalization process is generally not required in the SARIMAX model because SARIMAX is a linear model, and the structure of this model works directly with statistical coefficient estimation and focuses on modeling the structural dependencies of the time series (e.g., autoregressive relationships, seasonality, and error terms) rather than the absolute scales of the variables. In this study, normalization is not applied to the SARIMAX model [3].

2.2. Long-Short Term Memory

The LSTM architecture (see Figure 1) generally consists of three different gates: the input gate, the output gate, and the forget gate. LSTM is commonly adapted to time series algorithms. The input gate controls the extent to which new information should be written into the cell state. Like the forget gate, it uses an activation function to decide which values to update. Additionally, the candidate value g_t, which could potentially be added to the cell state, is generated using a tanh activation function [20]. The candidate value (C^~(t)) is indicated in Equation (2).

C^~(t) = tanh(W_cx^t + R_ch^t−1 + b_c)

(2)

The function of the input gate is shown in Equation (3).

i^(t) = σ(W_ix^t + R_ih^t−1 + b_i)

(3)

Here, the input gate (

i^{(t)}

) is illustrated.

W_{i}

and

R_{i}

represent the weights and b_i bias, respectively. σ refers to sigmoid activation function. These weights value can be initialized as zero, and they are updated with epochs in the training. Equation (4) states the values that the W, R, and b terms are given.

W = [\begin{matrix} W_{i} \\ W_{f} \\ W_{c} \\ W_{o} \end{matrix}], R = [\begin{matrix} R_{i} \\ R_{f} \\ R_{c} \\ R_{o} \end{matrix}], b = [\begin{matrix} b_{i} \\ b_{f} \\ b_{c} \\ b_{o} \end{matrix}]

(4)

The forget gate is responsible for determining which information from the previous cell state can be discarded. It takes the current input (

x^{t}

) and the previous hidden state (

h^{t - 1}

) as inputs, passes them through a sigmoid activation function, and produces a value between 0 and 1 for each number in the cell state (

c^{t - 1}

) [22]. A value close to 0 indicates that the information can be mostly forgotten, while a value close to 1 indicates that it can be retained. The cell state (

c^{(t)}

) is as shown in Equation (5).

C^(t) = C^~(t)⨀i^(t) + c^(t−1)⨀f^(t)

(5)

The cell state is updated by combining the effects of the forget gate and the input gate. The previous cell state (

c^{t - 1}

) is multiplied by the forget gate value (

f^{(t)}

), effectively forgetting parts of the state, and then the result is added to the product of the input gate value (

i^{(t)}

) and the candidate values (

C^{~ (t)}

).

As seen in Equation (3), an activation function must be determined, and there are various types, such as the functions Sigmoid, Tanh, Relu, and Elu [23]. For this research, sigmoid is selected as the activation function since it mostly satisfies the derivation on the data. Equation (5) shows the equality related to the sigmoid function. Another equation of activation functions is also indicated by the following:

σ (x) = \frac{1}{1 + e^{- x}} \tanh (x) = \frac{e^{x} - e^{- x}}{e^{x} + e^{- x}} elu (x) = \{\begin{matrix} x, x > 0 \\ α (e^{x} - 1), x < 0 \end{matrix} relu (x) = m a x (0, x)

(6)

As nonlinearity increases in activation functions, a decrease in loss values during the training phase is observed [6]. Increasing the nonlinearity in the LSTM structure is vital to estimate the time series which have random values, since greater derivation provides better estimation. Therefore, the sigmoid function is utilized in all three gates. Below, Equations (6) and (7) present the forget gate and output gate, respectively.

f^(t) = σ(W_fx^(t) + R_fh^t−1 + b_f)

(7)

The output gate determines the output of the current LSTM cell, which is based on the updated cell state. The output gate uses the current input (x^(t)) and the previous hidden state (

h^{t - 1}

) to decide what information from the cell state will be output. A sigmoid activation function is applied to a linear combination of these inputs:

o^(t) = σ(W_ox^(t) + R_oh^t−1 + b_o)

(8)

In Equation (8), the final output equation of the LSTM architecture is expressed as follows:

y^(t) = g(c^(t))⨀o^(t)

(9)

Figure 2a shows that every LSTM unit is composed of LSTM blocks, and Figure 2b shows that every LSTM unit contains the current input (

x_{t}

), previous hidden state (

h_{t - 1}

), and cell state (

c_{t - 1}

) so that these parameters enter the next LSTM unit and generate the current hidden state (

h_{t}

) and current cell state (

c_{t}

).

2.3. Transformers

The Transformer architecture was initially developed for NLP problems, but it has also started to be integrated and applied to time series problems. Since the problem at hand is related to time series, the Transformer architecture is also utilized in this application. The Transformer architecture aims to emphasize which information in the data is important and which part of the data should be attended to. In doing so, it looks at the relationships between the data’s other inputs. Therefore, the attention mechanism is a key feature in Transformer architecture.

The Transformer architecture consists of encoder and decoder blocks, within which the attention mechanism is formed. As shown in Figure 3, the attention mechanism itself consists of sub-blocks, and Figure 3 illustrates the number N, indicating the number of layers of the Transformer [24].

The most crucial part of the Transformer architecture is the self-attention mechanism. This mechanism is represented by three important vectors on the input side: the query (Q), key (K), and value (V) vectors. Figure 3 illustrates how the K, V, and Q vectors are incorporated into the attention mechanism. Here, the matrix multiplication of the input vectors with a matrix set as the initial value results in the K, V, and Q vectors. The score is obtained by the dot product of the Q and V vectors, indicating how similar the Q and V vectors are. In this study, considering temperature, humidity, and pressure values, there are three distinct sets of K, V, and Q vectors [25].

A t t e n t i o n (Q, K, V) = s o f t m a x (\frac{Q K^{T}}{\sqrt{d_{k}}}) V

(10)

The

\frac{1}{\sqrt{d_{k}}}

sscaling factor in Equation (10) represents the dimension d. In the block diagram shown in Figure 3, the attention mechanism, as given in Equation (11), is combined and linearly projected into a single matrix.

Multihead(Q, K, V) = Concat(head₁, …, head_h)W^O

(11)

Before delving into the input and output attention mechanisms, there is a structure called positional encoding, which provides information about the order and position of input or output vectors. In this study, as the temperature is a time series, knowing the sequence of humidity and pressure data relative to temperature over time can produce effective results in temperature prediction [26].

P E_{(p o s, 2 i)} = s i n (\frac{p o s}{10, 000^{\frac{2 i}{d_{m o d e l}}}})

(12)

P E_{(p o s, 2 i + 1)} = c o s (\frac{p o s}{10, 000^{\frac{2 i}{d_{m o d e l}}}})

(13)

In Equations (12) and (13), the pos term represents the position, and the i term represents the dimension. The reason for choosing sin and cos functions is to facilitate the learning process [27].

O(n²) expression is represented using Big-O notation, a standard way in computer science to describe the computational or memory complexity of an algorithm in relation to the input size. Here, n denotes the length of the input sequence. An expression of O(n²) indicates that the computational time or memory usage increases quadratically as the input length grows. In the context of attention mechanisms, this reflects the fact that each element in the sequence must attend to every other element, resulting in n × n comparisons [28].

Another structure in the Transformer architecture is the feedback-forward structure. One of the main features of the feedback-forward structure is to learn input–output mappings in a sequence that characterizes a real system. In this study, a multi-layer perceptron (MLP) structure is used as the feedback-forward structure. The preference for an MLP stems from the nature of the data as a time series [18]. The MLP in a Transformer consists of two fully connected layers with a nonlinear activation function in between. As indicated in the LSTM section, activation functions provide the derivation, leading to fewer errors in the estimating phase [4]. The MLP is a crucial component within the architecture that is applied independently to each position of the sequence. The MLP is utilized after the attention mechanism and is responsible for transforming and processing the data before it is passed on to the next layer. The use of the MLP in Transformers helps in capturing complex features and patterns in the data that are not easily discernible by the linear attention mechanism alone. It provides the model with the capacity to model nonlinear dependencies, which is significant for understanding and generating complex sequences such as language, images, or time series data.

As seen in Figure 4, the only difference from the Transformer architecture in Figure 3 is the utilization of the MLP algorithm in the feedback-forward mechanism.

2.4. Seasonal Autoregressive Moving Average with Exogenous (SARIMAX)

One of the prominent forecasting methods for time series data that exhibits seasonality is the SARIMA (Seasonal Autoregressive Integrated Moving Average) model. The SARIMA model is particularly inclined toward stationarity. The characteristics of SARIMA enable it to produce stable and reliable forecasts for future time periods [29]. Seasonal time series exhibit fluctuations that vary over time. Through temperature prediction, it often reveals periodic patterns based on daily, monthly, or yearly cycles. Since the temperature values in the dataset display strong seasonality and periodicity, the SARIMA (Seasonal Autoregressive Integrated Moving Average) model is employed to eliminate the adverse effects of these seasonal components across the entire dataset during the forecasting process. As a result, the SARIMA model can generate robust insights when dealing with seasonal and periodic time series data.

The SARIMAX model is derived by extending the SARIMA model with the inclusion of additional explanatory variables. SARIMAX is an advanced signal processing and statistical modeling technique that possesses both linear and polynomial characteristics. Its linearity is inherited from the ARIMA component, while its seasonality and polynomial behavior are captured by the SARIMA structure. Equation (14) below presents the formulation of the SARIMAX model, which effectively integrates both the ARIMA and SARIMA frameworks [3]. The parameters of Equation (14) are represented in Table 1.

S A R I M A X (p, d, q) {(P, D, Q)}_{s} : φ_{p} (G) φ_{P} (G^{s}) {(1 - G)}^{d} {(1 - G^{s})}^{D} X_{t} = α_{k} y_{k, t} + γ_{q} (G) ω_{Q} (G^{s}) e_{t}

(14)

As illustrated in Equation (14), seasonal and periodic parameters have been integrated into the model to ensure that the effects of seasonality and periodicity are appropriately captured across the entire dataset. Figure 5 illustrates that the temperature values have seasonality over ten years. Specifically, while the ARIMA component in Equation (14) is able to model short-term trends and fluctuations on an hourly or daily basis, the SARIMAX model focuses on capturing the broader seasonal patterns and long-term trends present in the temperature data. By incorporating additional explanatory features, the SARIMAX model, which is composed of the SARIMA and ARIMA models, is capable of learning both short-term and long-term dynamics, offering a more comprehensive understanding of the underlying temporal behavior [4].

Equation (15) shows the short version of Equation (14), and it indicates that SARIMAX models consist of seasonal and non-seasonal components and parameters. On the other hand,

S A R I M A X (p, d, q)

in Equation (15) stands for ARIMA, which is a non-seasonal approach, and it becomes SARIMA due to its ability to offer seasonal insights.

S A R I M A X (p, d, q) {(P, D, Q)}_{s} = S A R I M A (p, d, q) {(P, D, Q)}_{s} + \sum_{t = 1}^{r} γ^{r} x

(15)

ARIMA is a statistical model used to predict future values. It utilizes autoregressive methods to capture the predictions. Even though exponential smoothing approaches are built for the seasonality captured in the data, the ARIMA model describes autoregressive moving average linear model types in statistical predictions. Yet, the ARIMA model has to overcome significant challenges to make long-term estimations because it does not have order selection by times, and so it struggles to make strong estimations using the data, including estimations for seasonality and trends. As a result, the temperature values studied in this work include seasonality and trends. The ARIMA model is not sufficient for this work. It must be supported by the SARIMAX model to add seasonality and trend perspectives [4].

Although the SARIMAX model does not involve a gradient-based optimization process like LSTM or Transformer, its fitting procedure can still be computationally intensive. This is because SARIMAX parameter estimation relies on iterative maximum likelihood optimization, which requires the repeated evaluation of the likelihood function for different combinations of autoregressive, differencing, moving average, and seasonal parameters [30].

2.5. Metrics

In machine learning, when regression is performed, the preferred performance metrics are the mean absolute error (MAE), the mean squared error (MSE), and coefficient of determination (R²) score. The MAE, MSE, and R² scores are among the most widely used performance metrics. Their popularity comes from their ability to effectively quantify prediction errors, measure model accuracy, and compare different forecasting approaches. Since LSTM and SARIMAX capture short-term dependencies well, MAE helps evaluate how well the model follows actual trends. LSTMs benefit from MSE because it helps in learning stable patterns over sequences, preventing the model from making drastic prediction errors. For Transformers, which focus on long-range dependencies, MAE highlights how much overall deviation exists across time steps. Transformers, which use self-attention mechanisms, leverage MSE to refine long-range forecasting, ensuring that high-variance sequences remain stable. During the training phase, MSE is used to calculate error values. During the testing phase, MAE is employed to indicate the error between the actual temperature and the predicted temperature. In the training phase SARIMAX is not utilized for the MAE, MSE, or R² score.

In this study, the selection of optimal hyperparameters for the LSTM, Transformer, and SARIMAX models is carried out through an iterative search process based on performance metrics (MAE, MSE, and R²) for the validation set. While this study reports only the final selected hyperparameters, we acknowledge that presenting the performance variation across a range of values for key parameters (e.g., learning rate, hidden layer size, number of epochs for LSTM/Transformer; e.g., learning rate, hidden layer size, number of epochs for LSTM/Transformer; p, d, q, P, D, Q, s for SARIMAX) provide deeper insights into model robustness and the extent of the tuning effort required.

The R² score, also known as the coefficient of determination, indicates how well the predicted values of a model capture the variance in the actual data. An R² value of 1 means the model perfectly explains all the variability in the target variable, representing ideal predictive performance. An R² of 0 implies that the model fails to explain any variance and performs no better than simply predicting the mean of the observed data [31]. If the R² score is negative, it suggests that the model performs worse than a naive mean predictor. In general, a desirable R² score ranges between 0 and 1, with values closer to 1 indicating stronger explanatory power and more reliable predictions [15].

3. Results

The temperature predictions are obtained using the Transformer, LSTM, and SARIMAX models. Table 2 presents the best results achieved in the short term by these three models, while Table 3 shows the best results in the long term by the three models. In the LSTM model, a learning rate of 0.0001, 2 layers, 128 hidden layers, a 128-batch size, an Adam optimizer, and 24 backward samples are used. The LSTM model takes about 20 min to train with those parameters. The Transformer model achieved the best results with the following parameters: 16 attention layers, 4 transformer layers, 256-MLP size, a learning rate of 0.00001, and a batch size of 64. The Transformer model takes about 45 min to train with those parameters. The SARIMAX modes achieved the best results with the following parameters: an autoregressive term of 6, a degree of differencing of 1, a moving average term of 1, a seasonal autoregressive term of 1, a seasonal differencing of 1, a seasonal moving average term of 1, and a seasonal period of 24. Enforce stationarity and enforce invertibility are selected. The SARIMAX model takes about 1 h to train with those parameters and conditions. The results are shown in Table 2 according to the MAE, MSE, and R² score metrics. Based on Table 2 and Table 3, LSTM makes the best estimation compared to SARIMAX and Transformer for the short term, but the estimations of LSTM and SARIMAX are close to each other for the short term. Although SARIMAX performs well in short-term forecasts by capturing recent autoregressive, moving average, and seasonal patterns, its errors grow quickly in long-term predictions. In multi-step forecasting, each new prediction depends on the previous one, so small errors accumulate and lead to much larger deviations, which significantly increase the MSE. Furthermore, because SARIMAX is a linear model with fixed parameters, it cannot effectively capture nonlinear patterns, long-range dependencies, or structural changes in the data. As a result, shifts in seasonality, trends, or external factors further reduce its accuracy over longer horizons. MAE also rises sharply, reflecting the model’s poor ability to keep predictions close to observed values over time. In contrast, the Transformer achieves both low MSE and MAE, indicating stable and accurate forecasts. The R² score further highlights that SARIMAX produces negative values, meaning the model fits worse than a simple mean predictor. The Transformer maintains a clearly positive R², demonstrating that it captures meaningful structure in the data.

LSTM has the lowest MAE and MSE and is the only model with a positive R² score in the short term. This means that LSTM performs by far the best for short-term forecasting. SARIMAX performs very poorly, especially over the long term. Its MAE and MSE are very high, and its R² is very negative compared to those of LSTM and the Transformer. Transformers make the best estimation compared to SARIMAX and LSTM in the long term because the Transformer has the lowest MAE, MSE, and R² scores.

M A E = \frac{\sum_{i = 1}^{N} |y_{i} - {\hat{y}}_{i}|}{N}

(16)

M S E = \frac{1}{N} \sum_{i = 1}^{N} {(y_{i} - {\hat{y}}_{i})}^{2}

(17)

R^{2} = 1 - \frac{\sum_{i = 1}^{N} {(y_{i} - {\hat{y}}_{i})}^{2}}{\sum_{i = 1}^{N} {(y_{i} - \bar{y_{i}})}^{2}}

(18)

Equations (16)–(18) represent N, the number of samples; yi, the actual value; and ŷι, the predicted temperature value. Additionally, in this study, temperature predictions for the upcoming week and six months ahead are conducted, and errors for the following week and next six months are calculated in terms of the MAE, MSE, and R² scores.

The one-week and six-month forecasts of the obtained test data was also compared for the three models.

As shown in Figure 6, based on the graphs for the one-week temperature forecast, the three models capture the pattern of actual temperature values. But, as shown in Figure 7, based on the graphs for the six-month temperature forecast, it is mostly the Transformer that captures the pattern of actual temperature values. The SARIMAX and LSTM models do not capture the pattern in the long term. Although the LSTM model does not capture the pattern like SARIMAX in the long term, it achieves temperature prediction with low error compared to the SARIMAX model. LSTM networks are specifically designed for sequential data processing, utilizing memory cells to capture short-term and long- term dependencies. Temperature patterns exhibit recurring trends, which LSTM can efficiently learn over time. LSTM tends to generalize better compared to Transformer models, which typically require large amounts of data to effectively learn complex temporal relationships. In contrast, SARIMAX models offer a statistical approach that can handle both trend and seasonality components explicitly, and they perform well on time series with strong autocorrelation and when exogenous variables are available. Therefore, Transformers rely on self- attention mechanisms, which may struggle to capture local dependencies as effectively as LSTM in smaller datasets. Similarly, SARIMAX models, while effective for capturing linear trends and seasonal patterns, may fall short in modeling complex nonlinear dependencies inherent in temperature time series.

Figure 7, Figure 8, Figure 9, Figure 10, Figure 11 and Figure 12 show a two-dimensional histogram of errors made by the LSTM, Transformer, and SARIMAX models. Figure 8, Figure 9, Figure 10, Figure 11, Figure 12 and Figure 13 depict the distribution of errors obtained for one-week and six-month temperature predictions divided into hours in a two-dimensional histogram. The y-axis shows the absolute error value, and the x-axis expresses the 24 h period of a day. For example, the sum of errors for one week in the LSTM model, as shown in Figure 8, is about 3.6 °C in terms of absolute error at 10.00 AM, whereas the sum of errors for six months in the Transformer model, as shown in Figure 11, is about 2.98 °C in terms of absolute error at 10.00 AM. Similarly, the sum of errors for one week in SARIMAX, as shown in Figure 12, is 2.97 °C in terms of absolute error at 10.00 AM. Figure 8, Figure 9, Figure 10, Figure 11, Figure 12 and Figure 13 show at which times of the day the algorithms make more errors so that a solution can be established in order to decrease the errors. Based on the results shown in Figure 8, Figure 9, Figure 10, Figure 11, Figure 12 and Figure 13, it can be clearly seen that errors in temperature prediction are more pronounced during time periods close to sunrise, shortly after sunrise, occasionally at noon, and around sunset. It also shows that rapid changes in temperature during sunrise or sunset must be sensed by models. The LSTM model, in the short term, makes the biggest error during sunrise, whereas in the long term, it makes the biggest error during sunset and at noon. The Transformer model, in the short term and the long term, makes the biggest error at night and during sunrise. The SARIMAX model, in the short term, makes the biggest error at noon and during sunset, whereas in the long term it makes the biggest error mostly during sunrise.

Previous studies and this work show that smaller data and fewer features lead to better results for SARIMAX and LSTM in the short term, whereas larger data and more features in the dataset provide better outcomes for the Transformer in the long term, which is compatible with more complexity and bigger data sizes. Therefore, the Transformer model seems to produce smoother predictions, as shown in Figure 6 and Figure 10, failing to capture the sharp fluctuations present in the actual temperature at night in the short term. This result is likely due to the Transformer’s reliance on self-attention mechanisms, which tend to average out rapid changes and require larger datasets to effectively learn fine-grained temporal variations. When it comes to training time, the LSTM and Transformer models demonstrate greater efficiency compared to the SARIMAX model, based on the performance outcomes achieved. This suggests that deep learning-based approaches not only offer competitive predictive capabilities but also require less computational time during model training, making them more suitable for time-sensitive applications. The SARIMAX and LSTM models seem to produce better predictions, as seen in Figure 6, Figure 8, and Figure 12, for the short term. This result is likely because SARIMAX explicitly models seasonality and trend components using statistical formulations, while LSTM captures short-term dependencies more effectively through its memory cell structure, making both models well-suited for learning rapid fluctuations in temperature data. On the other hand, the Transformer model shows the best prediction in Figure 7 and Figure 11 for the long term. This result could be due to the model’s attention mechanism, which focuses on long-range dependencies but may overlook subtle, short-term variations that are crucial for accurate temperature forecasting. The SARIMAX and LSTM models do not make great predictions in Figure 7, Figure 9 and Figure 13 for the long term. These results show that SARIMAX is limited in capturing complex nonlinear dynamics beyond seasonal and trend components, and LSTM tends to lose effectiveness over longer time horizons due to its vanishing gradients and limited memory capacity.

4. Discussion

The obtained results indicate that the LSTM and SARIMAX models successfully capture the pattern of real temperatures in the short term, whereas the Transformer makes better predictions in the long term. Particularly in the short term, the LSTM model predicts the real temperature values with fewer errors compared to the Transformer and SARIMAX models, indicating its more effective and robust prediction capability. Similarly, the SARIMAX model also demonstrates a strong short-term performance by explicitly modeling seasonality and trend components. This overall success relies on LSTM’s ability to learn from complex nonlinear patterns and SARIMAX’s strength in capturing structured temporal dependencies in time series data. The Transformer model shows a superior performance in long-term predictions, thanks to its self-attention mechanism, which effectively captures global temporal dependencies and scales well in the long term.

Although this study primarily concentrates on comparing the forecasting performance of the examined models, it is equally important to explore the underlying mechanisms through which these models achieve their results—particularly to understand the factors contributing to the Transformer’s superior long-term forecasting capability. Recent research has introduced interpretability approaches, including attention weight visualization, saliency maps, and SHAP (SHapley Additive exPlanations), which offer means to identify the temporal patterns, lag dependencies, and frequency components that are most influential in the prediction process. Incorporating such techniques in future analyses could provide a more transparent understanding of the learned representations, thereby strengthening the connection between model behavior and the underlying time series signal characteristics.

Although the individual training durations of the models were relatively short, the overall experimentation process lasted approximately three weeks. This is primarily due to hardware limitations, as the machine used for training does not have sufficient computational power to efficiently explore different parameter configurations. As a result, obtaining workable and stable model outputs required significantly more time, despite the simplicity of each training iteration.

In future studies, more in-depth analyses will be conducted on the LSTM, SARIMAX, and Transformer models to further enhance their performance. Specifically, efforts will be made to improve the attention mechanism of the Transformer model and reduce the error rates of the model in terms of absolute error. The histograms in the Results section show the biggest error in a day, so attention mechanisms can be adapted accordingly. Additionally, the enhancement of data preprocessing techniques and better adjustment of model parameters (such as the number of layers, hidden layer size, attention mechanism, embedding size, autoregression, moving average, seasonality factors, etc.) for the three models are targeted to achieve more robust and accurate results. This enhancement can be translated to the hybrid model. By combining the strengths of the Transformer, LSTM, and SARIMAX models, it can be used to achieve better temperature prediction.

This study demonstrates the significant success in temperature prediction achieved by the LSTM, SARIMAX, and Transformer models, indicating that these models can provide more effective solutions to time series prediction problems in the future with further development. The success of these models highlights the great potential of time series data analysis and prediction. The hybrid model, which includes LSTM, SARIMAX, and Transformer together, can be evaluated as both algorithms have a strong ability to predict temperature in terms of attention mechanisms, memory, and seasonality.

In addition to the LSTM, SARIMAX, and Transformers models examined here, studies will be implemented on the Facebook Prophet library in future works, which is assumed to be particularly well-suited to time series data in terms of seasonality. The error data for the results obtained with the LSTM, SARIMAX, Transformer, hybrid, and Facebook Prophet models on the same dataset will be compared.

In addition to the LSTM, SARIMAX, and Transformer models used in this study, future work will explore the implementation of the Facebook Prophet library. Prophet is specifically designed for time series forecasting and is particularly well-suited for data exhibiting strong seasonal trends and holiday effects, making it a promising tool for temporal data analysis. Its intuitive modeling approach and automatic handling of seasonality components enable analysts to generate reliable forecasts even with limited domain-specific tuning. The error metrics obtained from the LSTM, SARIMAX, Transformer, and hybrid approaches and the Facebook Prophet model will be systematically compared using the same dataset to comprehensively evaluate performance. This comparison will help determine the relative strengths and weaknesses of each model, particularly in capturing long-term dependencies, handling nonlinear patterns, and adjusting to seasonal fluctuations. By incorporating Prophet into the experimental framework, this future work will aim to enhance the robustness of these findings and provide broader insights into time series forecasting techniques.

One of the primary limitations of this study is that the dataset used for temperature prediction was collected exclusively from Hungary, covering the period between 2006 and 2016. While the dataset provides a comprehensive and structured representation of temperature variations within the region, its geographic specificity may limit the generalizability of the findings to other climates and geographical conditions [12]. Temperature prediction models, including LSTM, SARIMAX, and Transformer models, rely on learning temporal dependencies within the data. However, climate patterns, seasonal variations, and atmospheric conditions differ significantly across regions [32]. For example, temperature fluctuations in Hungary may not accurately represent those in equatorial regions with stable weather patterns or in polar regions with extreme seasonal variations. This could impact on the model’s ability to generalize to diverse environments.

Author Contributions

Conceptualization, O.I. and C.K.; methodology, C.K.; software, C.K.; formal analysis, O.I. and C.K.; investigation, C.K.; resources, C.K.; data curation, O.I. and C.K.; writing—original draft preparation, O.I. and C.K.; writing—review and editing, O.I. and C.K.; visualization, O.I. and C.K.; supervision, O.I. and C.K.; project administration, O.I. and C.K.; funding acquisition, O.I. and C.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Alerskans, E.; Nyborg, J.; Birk, M.; Kaas, E. A transformer neural network for predicting near-surface temperature. Meteorol. Appl. 2022, 29, e2098. [Google Scholar] [CrossRef]
Dombaycı, Ö.A.; Gölcü, M. Daily means ambient temperature prediction using artificial neural network method: A case study of Turkey. Renew. Energy 2009, 34, 1158–1161. [Google Scholar] [CrossRef]
Shah, V.; Patel, N.; Shah, D.; Swain, D.; Mohanty, M.; Acharya, B.; Gerogiannis, V.C.; Kanavos, A. Forecasting Maximum Temperature Trends with SARIMAX: A Case Study from Ahmedabad, India. Sustainability 2024, 16, 7183. [Google Scholar] [CrossRef]
Alharbi, F.R.; Csala, D. A Seasonal Autoregressive Integrated Moving Average with Exogenous Factors (SARIMAX) Forecasting Model-Based Time Series Approach. Inventions 2022, 7, 94. [Google Scholar] [CrossRef]
Casallas, A. Long short-term memory artificial neural network approach to forecast meteorology and PM 2.5 local variables in Bogotá, Colombia. Model. Earth Syst. Environ. 2021, 8, 2951–2964. [Google Scholar] [CrossRef]
Cifuentes, J.; Marulanda, A.; Bello, A.; Reneses, J. Air temperature forecasting using machine learning techniques: A review. Energies 2020, 13, 4215. [Google Scholar] [CrossRef]
Singh, V.K. Prediction of greenhouse micro-climate using artificial neural network. Appl. Ecol. Environ. Res. 2017, 15, 767–778. [Google Scholar] [CrossRef]
Mulla, S.; Pande, C.B.; Singh, S.K. Times Series Forecasting of Monthly Rainfall using Seasonal Auto Regressive Integrated Moving Average with EXogenous Variables (SARIMAX) Model. Water Resour. Manag. 2024, 38, 1825–1846. [Google Scholar] [CrossRef]
Jung, D.H.; Kim, H.S.; Jhin, C.; Kim, H.J.; Park, S.H. Time-serial analysis of deep neural network models for prediction of climatic conditions inside a greenhouse. Comput. Electron. Agric. 2020, 173, 105402. [Google Scholar] [CrossRef]
Li, X.; Zhang, L.; Wang, X.; Liang, B. Forecasting greenhouse air and soil temperatures: A multi-step time series approach employing attention-based LSTM network. Comput. Electron. Agric. 2024, 217, 108602. [Google Scholar] [CrossRef]
Cui, B.; Liu, M.; Li, S.; Jin, Z.; Zeng, Y.; Lin, X. Deep learning methods for atmospheric PM2.5 prediction: A comparative study of transformer and CNN-LSTM-attention. Atmos. Pollut. Res. 2023, 14, 101833. [Google Scholar] [CrossRef]
Fu, Y.; Song, J.; Guo, J.; Fu, Y.; Cai, Y. Prediction and analysis of sea surface temperature based on LSTM-transformer model. Reg. Stud. Mar. Sci. 2024, 78, 103726. [Google Scholar] [CrossRef]
Fazla, A.; Aydin, M.E.; Kozat, S.S. Joint optimization of linear and nonlinear models for sequential regression. Digit. Signal Process. 2022, 132, 103802. [Google Scholar] [CrossRef]
Fente, D.N.; Singh, D.K. Weather Forecasting Using Artificial Neural Network; IEEE: New York, NY, USA, 2018; pp. 1757–1761. [Google Scholar] [CrossRef]
Setiawan, K.E.; Elwirehardja, G.N.; Pardamean, B. Sequence to sequence deep learning architecture for forecasting temperature and humidity inside closed space. In Proceedings of the 2022 10th International Conference on Cyber and IT Service Management, Yogyakarta, Indonesia, 20–21 September 2022; IEEE: New York, NY, USA, 2022; pp. 1–7. [Google Scholar] [CrossRef]
Luo, S.; Rao, Y.; Chen, J.; Wang, H.; Wang, Z. Short-Term Load Forecasting Model of Distribution Transformer Based on CNN and LSTM; IEEE: New York, NY, USA, 2020; pp. 1–4. [Google Scholar] [CrossRef]
Gu, X.; Yao, T.; Wang, X.; Wang, H.; Jiang, C.; Xiang, Q.; Yan, L. Improved AM-LSTM for Power Transformer Error Forecasting Model; IEEE: New York, NY, USA, 2021; pp. 530–533. [Google Scholar] [CrossRef]
Farsani, R.M.; Pazouki, E. A transformer self-attention model for time series forecasting. J. Electr. Comput. Eng. Innov. 2021, 9, 1–10. [Google Scholar]
Ferreira, L.B.; Da Cunha, F.F. New approach to estimate daily reference evapotranspiration based on hourly temperature and relative humidity using machine learning and deep learning. Agric. Water Manag. 2020, 234, 106113. [Google Scholar] [CrossRef]
Ozbek, A.; Sekertekin, A.; Bilgili, M.; Arslan, N. Prediction of 10-min, hourly, and daily atmospheric air temperature: Comparison of LSTM, ANFIS-FCM, and ARMA. Arab. J. Geosci. 2021, 14, 622. [Google Scholar] [CrossRef]
Jenkins, I.R.; Gee, L.O.; Knauss, A.; Yin, H.; Schroeder, J. Accident Scenario Generation with Recurrent Neural Networks; IEEE: New York, NY, USA, 2018; pp. 3340–3345. [Google Scholar] [CrossRef]
Hou, J.; Wang, Y.; Zhou, J.; Tian, Q. Prediction of hourly air temperature based on CNN–LSTM. Geomat. Nat. Hazards Risk 2022, 13, 1962–1986. [Google Scholar] [CrossRef]
Wang, Y.; Wang, S.; Fan, Y.; Xie, Y.; Hao, X.; Guerrero, J.M. High- precision collaborative estimation of lithium-ion battery state of health and remaining useful life based on call activation function library-long short term memory neural network algorithm. J. Energy Storage 2024, 83, 110749. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems; Curran Associates: Red Hook, NY, USA, 2017; pp. 6000–6010. [Google Scholar]
Guo, X.; Yang, D.; Jiang, L.; Du, T.; Lyu, S. Full-field temperature prediction in tunnel fires using limited monitored ceiling flow temperature data with transformer-based deep learning models. Fire Saf. J. 2024, 148, 104232. [Google Scholar] [CrossRef]
Oliveira, H.S.; Oliveira, H.P. Transformers for energy forecast. Sensors 2023, 23, 6840. [Google Scholar] [CrossRef]
Qingsong, W.; Linxiao, Y.; Tian, Z.; Liang, S. Robust time series analysis and applications: An industrial perspective. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, 14–18 August 2022; pp. 4836–4837. [Google Scholar]
Elango, V. ATTENTION2D: Communication Efficient Distributed Self-Attention Mechanism. arXiv 2025, arXiv:2503.15758v2. [Google Scholar] [CrossRef]
Elshewey, A.M.; Shams, M.Y.; Elhady, A.M.; Shohieb, S.M.; Abdelhamid, A.A.; Ibrahim, A.; Tarek, Z. A novel WD-SARIMAX model for temperature forecasting using Daily Delhi Climate Dataset. Sustainability 2022, 15, 757. [Google Scholar] [CrossRef]
Meybodi, Z.H.; Mohammadi, A.; Hou, M.; Rahimian, E.; Heidarian, S.; Abouei, J.; Plataniotis, K.N. Multi-content time-series popularity prediction with multiple-model transformers in MEC networks. Ad Hoc Netw. 2024, 157, 103436. [Google Scholar] [CrossRef]
Sevgin, F. Machine Learning-Based Temperature Forecasting for sustainable climate change adaptation and mitigation. Sustainability 2025, 17, 1812. [Google Scholar] [CrossRef]
Zhou, K.; Zhang, C.; Xu, B.; Huang, J.; Li, C.; Pei, Y. TE-LSTM: A prediction model for temperature based on multivariate time series data. Remote Sens. 2024, 16, 3666. [Google Scholar] [CrossRef]

Figure 1. Representation of the basic components of an LSTM cell [21].

Figure 2. (a) Representation of LSTM units and (b) LSTM cells [20].

Figure 3. Standard transformer architecture [24].

Figure 4. Standard transformer encoder [24].

Figure 5. The graphic shows the seasonality of temperature in the data.

Figure 6. The graphics for one-week predictions of Transformer, LSTM, and SARIMAX models.

Figure 7. The graphics for six-month predictions of Transformer, LSTM, and SARIMAX models.

Figure 8. Two-dimensional histogram showing error values divided into hours for one week for LSTM.

Figure 9. Two-dimensional histogram showing error values divided into hours for six months for LSTM.

Figure 10. Two-dimensional histogram showing error values divided into hours for one week for Transformer.

Figure 11. Two-dimensional histogram showing error values divided into hours for six months for Transformer.

Figure 12. Two-dimensional histogram showing error values divided into hours for one week for SARIMAX.

Figure 13. Two-dimensional histogram showing error values divided into hours for six months for SARIMAX.

Table 1. Description of parameters in Equation (14).

$φ_{p} (G)$	: non-seasonal autoregressive parameter
$φ_{P} (G^{s})$	: seasonal autoregressive parameter
${(1 - G)}^{d}$	: non-seasonal time series difference
${(1 - G^{s})}^{D}$	: seasonal time series difference
$X_{t}$	: observer value at time t
$α_{k}$	: the coefficient value of external exogenous inputs
$y_{k, t}$	: the number of external exogenous factors at time t
$γ_{q} (G)$	: non-seasonal moving average
$ω_{Q} (G^{s})$	: seasonal autoregressive
$e_{t}$	: prediction error at time t
p	: the order of non-seasonal autoregressive terms
d	: the degree of non-seasonal differencing
q	: the order of non-seasonal moving average terms
P	: the order of seasonal autoregressive terms
D	: the degree of seasonal differencing
Q	: the order of seasonal moving average terms
s	: seasonal periodicity

Table 2. LSTM, SARIMAX, and Transformer metrics according to MAE, MSE, and R² scores in short term.

Model	MAE	MSE	R² Score
LSTM	2.27	6.63	0.0273
Transformer	3.55	21.65	−2.1756
SARIMAX	2.61	9.64	−0.4138

Table 3. LSTM, SARIMAX, and Transformer metrics according to MAE, MSE, and R² score in long term.

Model	MAE	MSE	R² Score
LSTM	8.00	99.22	−0.3186
Transformer	2.99	14.92	0.8017
SARIMAX	33.72	1563.56	−19.7784

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Kişmiroğlu, C.; Isik, O. Temperature Prediction Using Transformer–LSTM Deep Learning Models and Sarimax from a Signal Processing Perspective. Appl. Sci. 2025, 15, 9372. https://doi.org/10.3390/app15179372

AMA Style

Kişmiroğlu C, Isik O. Temperature Prediction Using Transformer–LSTM Deep Learning Models and Sarimax from a Signal Processing Perspective. Applied Sciences. 2025; 15(17):9372. https://doi.org/10.3390/app15179372

Chicago/Turabian Style

Kişmiroğlu, Celalettin, and Omer Isik. 2025. "Temperature Prediction Using Transformer–LSTM Deep Learning Models and Sarimax from a Signal Processing Perspective" Applied Sciences 15, no. 17: 9372. https://doi.org/10.3390/app15179372

APA Style

Kişmiroğlu, C., & Isik, O. (2025). Temperature Prediction Using Transformer–LSTM Deep Learning Models and Sarimax from a Signal Processing Perspective. Applied Sciences, 15(17), 9372. https://doi.org/10.3390/app15179372

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Temperature Prediction Using Transformer–LSTM Deep Learning Models and Sarimax from a Signal Processing Perspective

Abstract

1. Introduction

2. Materials and Methods

2.1. Data Preprocessing

2.2. Long-Short Term Memory

2.3. Transformers

2.4. Seasonal Autoregressive Moving Average with Exogenous (SARIMAX)

2.5. Metrics

3. Results

4. Discussion

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI