1. Introduction
Temperature and ambient temperature are influenced by factors such as temperature, humidity, pressure, wind speed, and the amount of heat radiated by the sun onto the surface. Nowadays, highly accurate predictions can be made in meteorological forecasts using various features [
1,
2]. Despite these highly accurate predictions, further improvements in accuracy are necessary, particularly for applications in agriculture, environmental monitoring, and public health. In this study, the objective is to develop and compare machine learning models and signal processing methods for temperature prediction by utilizing fundamental meteorological parameters such as temperature, humidity, and pressure. This study aims to assess the effectiveness of two advanced deep learning architectures, Long Short-Term Memory (LSTM) and Transformer models, as well as the Seasonal AutoRegressive Integrated Moving Average with eXogenous (SARIMAX) model of signal processing, in capturing temperature variations over time [
3].
While temperature prediction is frequently used in meteorological forecasts and to monitor environmental conditions, machine learning, deep learning, and signal processing methods have contributed to more precise weather and ambient temperature predictions [
4,
5]. Due to the adverse effects caused by global warming, daily and hourly temperature predictions have become particularly important in agriculture, ecology systems, biological systems, and the health sector, especially for their uses in greenhouses, crop management, and disaster planning [
6,
7,
8].
As is known, factors such as temperature, pressure, and humidity, which affect weather conditions and ambient temperature, can be considered as time series data. Therefore, to predict temperature, it is necessary to learn temperature patterns by considering the relationship of the data with humidity and pressure [
9]. Nowadays, there are many algorithms available to learn the pattern of data in greenhouse temperature and environment temperature applications, such as Support Vector Machine (SVM), Random Forest (RF), Artificial Neural Network (ANN), and SARIMAX [
3,
10]. Transformers, LSTM architecture, and SARIMAX have gained widespread adoption in time series forecasting, sequential data modeling, and seasonality, outperforming traditional machine learning models such as SVM, RF, and ANN. LSTM and Transformer models are specifically designed to handle sequential data, making them highly effective for time series prediction and other tasks requiring temporal awareness. Unlike SVM and Random Forest, which treat data as independent instances, LSTM utilizes gated memory mechanisms to retain past information, while the Transformer architecture employs self-attention mechanisms to capture long-range dependencies [
11]. ANN, though capable of learning complex patterns, struggles with sequential dependencies due to its lack of an inherent memory mechanism [
12]. On the other hand, SARIMAX is an example of a statistical digital signal processing method. It models the time series using linear relationships among observations and incorporates trend and seasonal structures through different and seasonal autoregressive components [
13].
Temperature forecasting is a complex task that necessitates accurately capturing temporal dependencies, nonlinear patterns, and complex correlations present in time series data. Traditional machine learning algorithms, such as SVM and RF, primarily rely on handcrafted feature extraction and often fail to effectively model long-term dependencies in sequential data due to their inherent limitations in processing temporal relationships. Although these models may perform well in short-term forecasting scenarios, their ability to generalize across extended time horizons remains limited, making them less suitable for applications that require high accuracy over long periods [
10].
In this study, a Transformer architecture based on the basic principles of the Recurrent Neural Network (RNN) model, specifically the Long Short-Term Memory (LSTM) algorithm, the Natural Language Processing (NLP) algorithm, and the Seasonal AutoRegressive Integrated Moving Average with eXogenous (SARIMAX) model of statistical signal processing, is used [
13,
14,
15].
Since LSTM and Transformer models inherently have hidden layers and dimensions, activation functions are used to increase nonlinearity [
16,
17]. SARIMAX is a linear model with no hidden layers or activation function. It relies on autoregressive, integrated, and moving average components for temporal dependencies [
13].
In this study, while predicting temperature, LSTM, Transformer, and SARIMAX models are applied to a dataset including temperature, humidity, pressure, and time features. Then, a six-month temperature prediction is conducted to evaluate the performance of the models. The initial one-week interval is considered as a short-term forecast, while the complete six-month prediction is treated as a long-term forecast. This distinction allows for a more comprehensive assessment of each model’s ability to capture both immediate fluctuations and extended temporal patterns. Each model is examined to ascertain which period exhibits the highest error rate and which period yields the most favorable findings. It is expected to predict increasing and decreasing temperatures during sunrise and sunset with as little error as possible. Therefore, the architecture used aims to increase sensitivity to temperature fluctuations throughout the day by reducing linearity. Data was taken from an official and licensed Kaggle study, including hourly temperature, pressure, humidity, and windspeed, and more data was measured in Hungary between 2006 and 2016 years. Even though the data includes wind speed, visibility, and wind bearing, the first stage considers basic physical quantities like time, temperature, humidity, and pressure in order to see their compatibility with the algorithm.
2. Materials and Methods
2.1. Data Preprocessing
The data used in this study was obtained by sampling temperature, humidity, and pressure sensors every hour for a period of 10 years and 5 months between 2006 and 2016 years in Hungary. There are positive or negative floating-point values in the dataset. These numbers must be within a certain range, yet the applied models can turn these floating-point values into numbers which can cause more errors. In total, there are 96,453 records available in the data, and missing or incorrect values are not included in the data. In this study, time, temperature, humidity, and pressure are considered, while the rest of the features are not included in work. The features not included in work were eliminated by a data preprocessing step using Python (v3.8.6) code. In this study, we utilized the Weather in Szeged 2006–2016 dataset. The dataset is publicly available at
https://www.kaggle.com/datasets/budincsevity/szeged-weather/data (accessed on 1 April 2024).
In machine learning models, data preprocessing is a critical step that involves transforming raw data into a suitable format for modeling. Proper preprocessing ensures that the data is clean, consistent, and suitable for machine learning algorithms, ultimately improving the performance of models. Prior to model implementation, the dataset is chronologically sorted to ensure temporal consistency, as the raw data exhibited an irregular date sequence. The dataset is partitioned into training and testing subsets, with 95% of the data allocated for training and the remaining 5% reserved for testing purposes. The test data includes about the last six months of raw data. The preprocessing stage can be broken down into several key processes; the most frequently used data preparation processes are normalization or standardization. Data preprocessing plays a crucial role in enhancing the performance of machine learning models, particularly in time series forecasting applications. Normalization is one of the most commonly used methods due to its ability to scale numerical features to a standardized range, typically between [0, 1] or [−1, 1]. In this study, [0, 1] transformation is used because of the selection of normalization in Equation (1). This transformation is essential for improving the efficiency and stability of deep learning models, including Long Short-Term Memory (LSTM) networks and Transformer architectures. Equation (1) shows the parameters [
18].
In the normalization process mentioned Equation (1), the expression µ(x) represents the mean of feature x, while σ(x) represents the standard deviation of feature x [
19]. Temperature, pressure, and humidity values are features, and normalization must be applied to every feature separately.
The normalization process is generally not required in the SARIMAX model because SARIMAX is a linear model, and the structure of this model works directly with statistical coefficient estimation and focuses on modeling the structural dependencies of the time series (e.g., autoregressive relationships, seasonality, and error terms) rather than the absolute scales of the variables. In this study, normalization is not applied to the SARIMAX model [
3].
2.2. Long-Short Term Memory
The LSTM architecture (see
Figure 1) generally consists of three different gates: the input gate, the output gate, and the forget gate. LSTM is commonly adapted to time series algorithms. The input gate controls the extent to which new information should be written into the cell state. Like the forget gate, it uses an activation function to decide which values to update. Additionally, the candidate value g_t, which could potentially be added to the cell state, is generated using a tanh activation function [
20]. The candidate value (
C~(t)) is indicated in Equation (2).
The function of the input gate is shown in Equation (3).
Here, the input gate (
) is illustrated.
and
represent the weights and
bi bias, respectively. σ refers to sigmoid activation function. These weights value can be initialized as zero, and they are updated with epochs in the training. Equation (4) states the values that the
W,
R, and
b terms are given.
The forget gate is responsible for determining which information from the previous cell state can be discarded. It takes the current input (
) and the previous hidden state (
) as inputs, passes them through a sigmoid activation function, and produces a value between 0 and 1 for each number in the cell state (
) [
22]. A value close to 0 indicates that the information can be mostly forgotten, while a value close to 1 indicates that it can be retained. The cell state (
) is as shown in Equation (5).
The cell state is updated by combining the effects of the forget gate and the input gate. The previous cell state () is multiplied by the forget gate value (), effectively forgetting parts of the state, and then the result is added to the product of the input gate value () and the candidate values ().
As seen in Equation (3), an activation function must be determined, and there are various types, such as the functions Sigmoid, Tanh, Relu, and Elu [
23]. For this research, sigmoid is selected as the activation function since it mostly satisfies the derivation on the data. Equation (5) shows the equality related to the sigmoid function. Another equation of activation functions is also indicated by the following:
As nonlinearity increases in activation functions, a decrease in loss values during the training phase is observed [
6]. Increasing the nonlinearity in the LSTM structure is vital to estimate the time series which have random values, since greater derivation provides better estimation. Therefore, the sigmoid function is utilized in all three gates. Below, Equations (6) and (7) present the forget gate and output gate, respectively.
The output gate determines the output of the current LSTM cell, which is based on the updated cell state. The output gate uses the current input (
x(t)) and the previous hidden state (
) to decide what information from the cell state will be output. A sigmoid activation function is applied to a linear combination of these inputs:
In Equation (8), the final output equation of the LSTM architecture is expressed as follows:
Figure 2a shows that every LSTM unit is composed of LSTM blocks, and
Figure 2b shows that every LSTM unit contains the current input (
), previous hidden state (
), and cell state (
) so that these parameters enter the next LSTM unit and generate the current hidden state (
) and current cell state (
).
2.3. Transformers
The Transformer architecture was initially developed for NLP problems, but it has also started to be integrated and applied to time series problems. Since the problem at hand is related to time series, the Transformer architecture is also utilized in this application. The Transformer architecture aims to emphasize which information in the data is important and which part of the data should be attended to. In doing so, it looks at the relationships between the data’s other inputs. Therefore, the attention mechanism is a key feature in Transformer architecture.
The Transformer architecture consists of encoder and decoder blocks, within which the attention mechanism is formed. As shown in
Figure 3, the attention mechanism itself consists of sub-blocks, and
Figure 3 illustrates the number N, indicating the number of layers of the Transformer [
24].
The most crucial part of the Transformer architecture is the self-attention mechanism. This mechanism is represented by three important vectors on the input side: the query (
Q), key (
K), and value (
V) vectors.
Figure 3 illustrates how the
K,
V, and
Q vectors are incorporated into the attention mechanism. Here, the matrix multiplication of the input vectors with a matrix set as the initial value results in the
K,
V, and
Q vectors. The score is obtained by the dot product of the
Q and
V vectors, indicating how similar the
Q and
V vectors are. In this study, considering temperature, humidity, and pressure values, there are three distinct sets of
K,
V, and
Q vectors [
25].
The
sscaling factor in Equation (10) represents the dimension
d. In the block diagram shown in
Figure 3, the attention mechanism, as given in Equation (11), is combined and linearly projected into a single matrix.
Before delving into the input and output attention mechanisms, there is a structure called positional encoding, which provides information about the order and position of input or output vectors. In this study, as the temperature is a time series, knowing the sequence of humidity and pressure data relative to temperature over time can produce effective results in temperature prediction [
26].
In Equations (12) and (13), the pos term represents the position, and the
i term represents the dimension. The reason for choosing sin and cos functions is to facilitate the learning process [
27].
O(n
2) expression is represented using Big-O notation, a standard way in computer science to describe the computational or memory complexity of an algorithm in relation to the input size. Here, n denotes the length of the input sequence. An expression of O(n
2) indicates that the computational time or memory usage increases quadratically as the input length grows. In the context of attention mechanisms, this reflects the fact that each element in the sequence must attend to every other element, resulting in n × n comparisons [
28].
Another structure in the Transformer architecture is the feedback-forward structure. One of the main features of the feedback-forward structure is to learn input–output mappings in a sequence that characterizes a real system. In this study, a multi-layer perceptron (MLP) structure is used as the feedback-forward structure. The preference for an MLP stems from the nature of the data as a time series [
18]. The MLP in a Transformer consists of two fully connected layers with a nonlinear activation function in between. As indicated in the LSTM section, activation functions provide the derivation, leading to fewer errors in the estimating phase [
4]. The MLP is a crucial component within the architecture that is applied independently to each position of the sequence. The MLP is utilized after the attention mechanism and is responsible for transforming and processing the data before it is passed on to the next layer. The use of the MLP in Transformers helps in capturing complex features and patterns in the data that are not easily discernible by the linear attention mechanism alone. It provides the model with the capacity to model nonlinear dependencies, which is significant for understanding and generating complex sequences such as language, images, or time series data.
As seen in
Figure 4, the only difference from the Transformer architecture in
Figure 3 is the utilization of the MLP algorithm in the feedback-forward mechanism.
2.4. Seasonal Autoregressive Moving Average with Exogenous (SARIMAX)
One of the prominent forecasting methods for time series data that exhibits seasonality is the SARIMA (Seasonal Autoregressive Integrated Moving Average) model. The SARIMA model is particularly inclined toward stationarity. The characteristics of SARIMA enable it to produce stable and reliable forecasts for future time periods [
29]. Seasonal time series exhibit fluctuations that vary over time. Through temperature prediction, it often reveals periodic patterns based on daily, monthly, or yearly cycles. Since the temperature values in the dataset display strong seasonality and periodicity, the SARIMA (Seasonal Autoregressive Integrated Moving Average) model is employed to eliminate the adverse effects of these seasonal components across the entire dataset during the forecasting process. As a result, the SARIMA model can generate robust insights when dealing with seasonal and periodic time series data.
The SARIMAX model is derived by extending the SARIMA model with the inclusion of additional explanatory variables. SARIMAX is an advanced signal processing and statistical modeling technique that possesses both linear and polynomial characteristics. Its linearity is inherited from the ARIMA component, while its seasonality and polynomial behavior are captured by the SARIMA structure. Equation (14) below presents the formulation of the SARIMAX model, which effectively integrates both the ARIMA and SARIMA frameworks [
3]. The parameters of Equation (14) are represented in
Table 1.
As illustrated in Equation (14), seasonal and periodic parameters have been integrated into the model to ensure that the effects of seasonality and periodicity are appropriately captured across the entire dataset.
Figure 5 illustrates that the temperature values have seasonality over ten years. Specifically, while the ARIMA component in Equation (14) is able to model short-term trends and fluctuations on an hourly or daily basis, the SARIMAX model focuses on capturing the broader seasonal patterns and long-term trends present in the temperature data. By incorporating additional explanatory features, the SARIMAX model, which is composed of the SARIMA and ARIMA models, is capable of learning both short-term and long-term dynamics, offering a more comprehensive understanding of the underlying temporal behavior [
4].
Equation (15) shows the short version of Equation (14), and it indicates that SARIMAX models consist of seasonal and non-seasonal components and parameters. On the other hand,
in Equation (15) stands for ARIMA, which is a non-seasonal approach, and it becomes SARIMA due to its ability to offer seasonal insights.
ARIMA is a statistical model used to predict future values. It utilizes autoregressive methods to capture the predictions. Even though exponential smoothing approaches are built for the seasonality captured in the data, the ARIMA model describes autoregressive moving average linear model types in statistical predictions. Yet, the ARIMA model has to overcome significant challenges to make long-term estimations because it does not have order selection by times, and so it struggles to make strong estimations using the data, including estimations for seasonality and trends. As a result, the temperature values studied in this work include seasonality and trends. The ARIMA model is not sufficient for this work. It must be supported by the SARIMAX model to add seasonality and trend perspectives [
4].
Although the SARIMAX model does not involve a gradient-based optimization process like LSTM or Transformer, its fitting procedure can still be computationally intensive. This is because SARIMAX parameter estimation relies on iterative maximum likelihood optimization, which requires the repeated evaluation of the likelihood function for different combinations of autoregressive, differencing, moving average, and seasonal parameters [
30].
2.5. Metrics
In machine learning, when regression is performed, the preferred performance metrics are the mean absolute error (MAE), the mean squared error (MSE), and coefficient of determination (R2) score. The MAE, MSE, and R2 scores are among the most widely used performance metrics. Their popularity comes from their ability to effectively quantify prediction errors, measure model accuracy, and compare different forecasting approaches. Since LSTM and SARIMAX capture short-term dependencies well, MAE helps evaluate how well the model follows actual trends. LSTMs benefit from MSE because it helps in learning stable patterns over sequences, preventing the model from making drastic prediction errors. For Transformers, which focus on long-range dependencies, MAE highlights how much overall deviation exists across time steps. Transformers, which use self-attention mechanisms, leverage MSE to refine long-range forecasting, ensuring that high-variance sequences remain stable. During the training phase, MSE is used to calculate error values. During the testing phase, MAE is employed to indicate the error between the actual temperature and the predicted temperature. In the training phase SARIMAX is not utilized for the MAE, MSE, or R2 score.
In this study, the selection of optimal hyperparameters for the LSTM, Transformer, and SARIMAX models is carried out through an iterative search process based on performance metrics (MAE, MSE, and R2) for the validation set. While this study reports only the final selected hyperparameters, we acknowledge that presenting the performance variation across a range of values for key parameters (e.g., learning rate, hidden layer size, number of epochs for LSTM/Transformer; e.g., learning rate, hidden layer size, number of epochs for LSTM/Transformer; p, d, q, P, D, Q, s for SARIMAX) provide deeper insights into model robustness and the extent of the tuning effort required.
The
R2 score, also known as the coefficient of determination, indicates how well the predicted values of a model capture the variance in the actual data. An
R2 value of 1 means the model perfectly explains all the variability in the target variable, representing ideal predictive performance. An
R2 of 0 implies that the model fails to explain any variance and performs no better than simply predicting the mean of the observed data [
31]. If the
R2 score is negative, it suggests that the model performs worse than a naive mean predictor. In general, a desirable
R2 score ranges between 0 and 1, with values closer to 1 indicating stronger explanatory power and more reliable predictions [
15].
3. Results
The temperature predictions are obtained using the Transformer, LSTM, and SARIMAX models.
Table 2 presents the best results achieved in the short term by these three models, while
Table 3 shows the best results in the long term by the three models. In the LSTM model, a learning rate of 0.0001, 2 layers, 128 hidden layers, a 128-batch size, an Adam optimizer, and 24 backward samples are used. The LSTM model takes about 20 min to train with those parameters. The Transformer model achieved the best results with the following parameters: 16 attention layers, 4 transformer layers, 256-MLP size, a learning rate of 0.00001, and a batch size of 64. The Transformer model takes about 45 min to train with those parameters. The SARIMAX modes achieved the best results with the following parameters: an autoregressive term of 6, a degree of differencing of 1, a moving average term of 1, a seasonal autoregressive term of 1, a seasonal differencing of 1, a seasonal moving average term of 1, and a seasonal period of 24. Enforce stationarity and enforce invertibility are selected. The SARIMAX model takes about 1 h to train with those parameters and conditions. The results are shown in
Table 2 according to the
MAE,
MSE, and
R2 score metrics. Based on
Table 2 and
Table 3, LSTM makes the best estimation compared to SARIMAX and Transformer for the short term, but the estimations of LSTM and SARIMAX are close to each other for the short term. Although SARIMAX performs well in short-term forecasts by capturing recent autoregressive, moving average, and seasonal patterns, its errors grow quickly in long-term predictions. In multi-step forecasting, each new prediction depends on the previous one, so small errors accumulate and lead to much larger deviations, which significantly increase the
MSE. Furthermore, because SARIMAX is a linear model with fixed parameters, it cannot effectively capture nonlinear patterns, long-range dependencies, or structural changes in the data. As a result, shifts in seasonality, trends, or external factors further reduce its accuracy over longer horizons.
MAE also rises sharply, reflecting the model’s poor ability to keep predictions close to observed values over time. In contrast, the Transformer achieves both low
MSE and
MAE, indicating stable and accurate forecasts. The
R2 score further highlights that SARIMAX produces negative values, meaning the model fits worse than a simple mean predictor. The Transformer maintains a clearly positive
R2, demonstrating that it captures meaningful structure in the data.
LSTM has the lowest
MAE and
MSE and is the only model with a positive
R2 score in the short term. This means that LSTM performs by far the best for short-term forecasting. SARIMAX performs very poorly, especially over the long term. Its
MAE and
MSE are very high, and its
R2 is very negative compared to those of LSTM and the Transformer. Transformers make the best estimation compared to SARIMAX and LSTM in the long term because the Transformer has the lowest
MAE,
MSE, and
R2 scores.
Equations (16)–(18) represent N, the number of samples; yi, the actual value; and ŷι, the predicted temperature value. Additionally, in this study, temperature predictions for the upcoming week and six months ahead are conducted, and errors for the following week and next six months are calculated in terms of the MAE, MSE, and R2 scores.
The one-week and six-month forecasts of the obtained test data was also compared for the three models.
As shown in
Figure 6, based on the graphs for the one-week temperature forecast, the three models capture the pattern of actual temperature values. But, as shown in
Figure 7, based on the graphs for the six-month temperature forecast, it is mostly the Transformer that captures the pattern of actual temperature values. The SARIMAX and LSTM models do not capture the pattern in the long term. Although the LSTM model does not capture the pattern like SARIMAX in the long term, it achieves temperature prediction with low error compared to the SARIMAX model. LSTM networks are specifically designed for sequential data processing, utilizing memory cells to capture short-term and long- term dependencies. Temperature patterns exhibit recurring trends, which LSTM can efficiently learn over time. LSTM tends to generalize better compared to Transformer models, which typically require large amounts of data to effectively learn complex temporal relationships. In contrast, SARIMAX models offer a statistical approach that can handle both trend and seasonality components explicitly, and they perform well on time series with strong autocorrelation and when exogenous variables are available. Therefore, Transformers rely on self- attention mechanisms, which may struggle to capture local dependencies as effectively as LSTM in smaller datasets. Similarly, SARIMAX models, while effective for capturing linear trends and seasonal patterns, may fall short in modeling complex nonlinear dependencies inherent in temperature time series.
Figure 7,
Figure 8,
Figure 9,
Figure 10,
Figure 11 and
Figure 12 show a two-dimensional histogram of errors made by the LSTM, Transformer, and SARIMAX models.
Figure 8,
Figure 9,
Figure 10,
Figure 11,
Figure 12 and
Figure 13 depict the distribution of errors obtained for one-week and six-month temperature predictions divided into hours in a two-dimensional histogram. The y-axis shows the absolute error value, and the x-axis expresses the 24 h period of a day. For example, the sum of errors for one week in the LSTM model, as shown in
Figure 8, is about 3.6 °C in terms of absolute error at 10.00 AM, whereas the sum of errors for six months in the Transformer model, as shown in
Figure 11, is about 2.98 °C in terms of absolute error at 10.00 AM. Similarly, the sum of errors for one week in SARIMAX, as shown in
Figure 12, is 2.97 °C in terms of absolute error at 10.00 AM.
Figure 8,
Figure 9,
Figure 10,
Figure 11,
Figure 12 and
Figure 13 show at which times of the day the algorithms make more errors so that a solution can be established in order to decrease the errors. Based on the results shown in
Figure 8,
Figure 9,
Figure 10,
Figure 11,
Figure 12 and
Figure 13, it can be clearly seen that errors in temperature prediction are more pronounced during time periods close to sunrise, shortly after sunrise, occasionally at noon, and around sunset. It also shows that rapid changes in temperature during sunrise or sunset must be sensed by models. The LSTM model, in the short term, makes the biggest error during sunrise, whereas in the long term, it makes the biggest error during sunset and at noon. The Transformer model, in the short term and the long term, makes the biggest error at night and during sunrise. The SARIMAX model, in the short term, makes the biggest error at noon and during sunset, whereas in the long term it makes the biggest error mostly during sunrise.
Previous studies and this work show that smaller data and fewer features lead to better results for SARIMAX and LSTM in the short term, whereas larger data and more features in the dataset provide better outcomes for the Transformer in the long term, which is compatible with more complexity and bigger data sizes. Therefore, the Transformer model seems to produce smoother predictions, as shown in
Figure 6 and
Figure 10, failing to capture the sharp fluctuations present in the actual temperature at night in the short term. This result is likely due to the Transformer’s reliance on self-attention mechanisms, which tend to average out rapid changes and require larger datasets to effectively learn fine-grained temporal variations. When it comes to training time, the LSTM and Transformer models demonstrate greater efficiency compared to the SARIMAX model, based on the performance outcomes achieved. This suggests that deep learning-based approaches not only offer competitive predictive capabilities but also require less computational time during model training, making them more suitable for time-sensitive applications. The SARIMAX and LSTM models seem to produce better predictions, as seen in
Figure 6,
Figure 8, and
Figure 12, for the short term. This result is likely because SARIMAX explicitly models seasonality and trend components using statistical formulations, while LSTM captures short-term dependencies more effectively through its memory cell structure, making both models well-suited for learning rapid fluctuations in temperature data. On the other hand, the Transformer model shows the best prediction in
Figure 7 and
Figure 11 for the long term. This result could be due to the model’s attention mechanism, which focuses on long-range dependencies but may overlook subtle, short-term variations that are crucial for accurate temperature forecasting. The SARIMAX and LSTM models do not make great predictions in
Figure 7,
Figure 9 and
Figure 13 for the long term. These results show that SARIMAX is limited in capturing complex nonlinear dynamics beyond seasonal and trend components, and LSTM tends to lose effectiveness over longer time horizons due to its vanishing gradients and limited memory capacity.
4. Discussion
The obtained results indicate that the LSTM and SARIMAX models successfully capture the pattern of real temperatures in the short term, whereas the Transformer makes better predictions in the long term. Particularly in the short term, the LSTM model predicts the real temperature values with fewer errors compared to the Transformer and SARIMAX models, indicating its more effective and robust prediction capability. Similarly, the SARIMAX model also demonstrates a strong short-term performance by explicitly modeling seasonality and trend components. This overall success relies on LSTM’s ability to learn from complex nonlinear patterns and SARIMAX’s strength in capturing structured temporal dependencies in time series data. The Transformer model shows a superior performance in long-term predictions, thanks to its self-attention mechanism, which effectively captures global temporal dependencies and scales well in the long term.
Although this study primarily concentrates on comparing the forecasting performance of the examined models, it is equally important to explore the underlying mechanisms through which these models achieve their results—particularly to understand the factors contributing to the Transformer’s superior long-term forecasting capability. Recent research has introduced interpretability approaches, including attention weight visualization, saliency maps, and SHAP (SHapley Additive exPlanations), which offer means to identify the temporal patterns, lag dependencies, and frequency components that are most influential in the prediction process. Incorporating such techniques in future analyses could provide a more transparent understanding of the learned representations, thereby strengthening the connection between model behavior and the underlying time series signal characteristics.
Although the individual training durations of the models were relatively short, the overall experimentation process lasted approximately three weeks. This is primarily due to hardware limitations, as the machine used for training does not have sufficient computational power to efficiently explore different parameter configurations. As a result, obtaining workable and stable model outputs required significantly more time, despite the simplicity of each training iteration.
In future studies, more in-depth analyses will be conducted on the LSTM, SARIMAX, and Transformer models to further enhance their performance. Specifically, efforts will be made to improve the attention mechanism of the Transformer model and reduce the error rates of the model in terms of absolute error. The histograms in the Results section show the biggest error in a day, so attention mechanisms can be adapted accordingly. Additionally, the enhancement of data preprocessing techniques and better adjustment of model parameters (such as the number of layers, hidden layer size, attention mechanism, embedding size, autoregression, moving average, seasonality factors, etc.) for the three models are targeted to achieve more robust and accurate results. This enhancement can be translated to the hybrid model. By combining the strengths of the Transformer, LSTM, and SARIMAX models, it can be used to achieve better temperature prediction.
This study demonstrates the significant success in temperature prediction achieved by the LSTM, SARIMAX, and Transformer models, indicating that these models can provide more effective solutions to time series prediction problems in the future with further development. The success of these models highlights the great potential of time series data analysis and prediction. The hybrid model, which includes LSTM, SARIMAX, and Transformer together, can be evaluated as both algorithms have a strong ability to predict temperature in terms of attention mechanisms, memory, and seasonality.
In addition to the LSTM, SARIMAX, and Transformers models examined here, studies will be implemented on the Facebook Prophet library in future works, which is assumed to be particularly well-suited to time series data in terms of seasonality. The error data for the results obtained with the LSTM, SARIMAX, Transformer, hybrid, and Facebook Prophet models on the same dataset will be compared.
In addition to the LSTM, SARIMAX, and Transformer models used in this study, future work will explore the implementation of the Facebook Prophet library. Prophet is specifically designed for time series forecasting and is particularly well-suited for data exhibiting strong seasonal trends and holiday effects, making it a promising tool for temporal data analysis. Its intuitive modeling approach and automatic handling of seasonality components enable analysts to generate reliable forecasts even with limited domain-specific tuning. The error metrics obtained from the LSTM, SARIMAX, Transformer, and hybrid approaches and the Facebook Prophet model will be systematically compared using the same dataset to comprehensively evaluate performance. This comparison will help determine the relative strengths and weaknesses of each model, particularly in capturing long-term dependencies, handling nonlinear patterns, and adjusting to seasonal fluctuations. By incorporating Prophet into the experimental framework, this future work will aim to enhance the robustness of these findings and provide broader insights into time series forecasting techniques.
One of the primary limitations of this study is that the dataset used for temperature prediction was collected exclusively from Hungary, covering the period between 2006 and 2016. While the dataset provides a comprehensive and structured representation of temperature variations within the region, its geographic specificity may limit the generalizability of the findings to other climates and geographical conditions [
12]. Temperature prediction models, including LSTM, SARIMAX, and Transformer models, rely on learning temporal dependencies within the data. However, climate patterns, seasonal variations, and atmospheric conditions differ significantly across regions [
32]. For example, temperature fluctuations in Hungary may not accurately represent those in equatorial regions with stable weather patterns or in polar regions with extreme seasonal variations. This could impact on the model’s ability to generalize to diverse environments.