Retail Demand Forecasting: A Comparative Analysis of Deep Neural Networks and the Proposal of LSTMixer, a Linear Model Extension

Theodoridis, Georgios; Tsadiras, Athanasios

doi:10.3390/info16070596

Open AccessArticle

Retail Demand Forecasting: A Comparative Analysis of Deep Neural Networks and the Proposal of LSTMixer, a Linear Model Extension

by

Georgios Theodoridis

^*

and

Athanasios Tsadiras

School of Economics, Aristotle University of Thessaloniki, 54124 Thessaloniki, Greece

^*

Author to whom correspondence should be addressed.

Information 2025, 16(7), 596; https://doi.org/10.3390/info16070596

Submission received: 31 May 2025 / Revised: 4 July 2025 / Accepted: 9 July 2025 / Published: 11 July 2025

(This article belongs to the Special Issue Artificial Intelligence (AI) for Economics and Business Management)

Download

Browse Figures

Versions Notes

Abstract

Accurate retail demand forecasting is integral to the operational efficiency of any retail business. As demand is described over time, the prediction of demand is a time-series forecasting problem which may be addressed in a univariate manner, via statistical methods and simplistic machine learning approaches, or in a multivariate fashion using generic deep learning forecasters that are well-established in other fields. This study analyzes, optimizes, trains and tests such forecasters, namely the Temporal Fusion Transformer and the Temporal Convolutional Network, alongside the recently proposed Time-Series Mixer, to accurately forecast retail demand given a dataset of historical sales in 45 stores with their accompanied features. Moreover, the present work proposes a novel extension of the Time-Series Mixer architecture, the LSTMixer, which utilizes an additional Long Short-Term Memory block to achieve better forecasts. The results indicate that the proposed LSTMixer model is the better predictor, whilst all the other aforementioned models outperform the common statistical and machine learning methods. An ablation test is also performed to ensure that the extension within the LSTMixer design is responsible for the improved results. The findings promote the use of deep learning models for retail demand forecasting problems and establish LSTMixer as a viable and efficient option.

Keywords:

retail demand forecasting; time-series forecasting; deep neural networks; multivariate forecasting

1. Introduction

The success of any business within the retail sector heavily depends on its ability to constantly increase the efficiency of its operation [1]. To do so, it is fundamental to perform accurate demand forecasts to optimize not only internal operations but also business’s agility within the supply chain [2], granting the strategic ability to quickly sense and respond to changes, which is highly correlated with business success [3,4]. Accurate retail forecasting models in recent studies [5,6] and their application directly benefit consumer satisfaction levels and stock optimization and reduce the cost of operation [7]. Moreover, inaccurate forecasts not only increase costs and result in overstocking/understocking but potentially compromise businesses when utilized for risk estimation [8].

In practice, retail demand involves multiple time-dependent points for analysis and prediction as well as a plethora of supporting information that explains and affects the trends of future demand. Most retail businesses manage multiple items, item categories and stores and must forecast each individual component, potentially at multiple aggregation levels—for example, the sales of all items per store versus the sales of all items regardless of the store. Therefore, retail demand is expressed as a collection of time series and retail demand forecasting is translated into a time-series forecasting problem which may be addressed in a univariate or multivariate fashion. A univariate solution handles each component separately; future demand is predicted given the past values of each component alone. Multivariate solutions include all components, and the predicted demand of each is affected by the past values of every component. They may also include additional covariate values that are used to forecast each component, and belong in one of the following types:

Past Covariates: Values that are historically known, similar to component values.
Future Covariates: Values that are historically known whose future values are also known in advance.
Static Covariates: Values that remain static over time per component and usually categorize or discreetly describe each component.

Bibliographically, retail demand forecasting studies predominantly employ univariate statistical techniques such as ARIMA [9,10,11] or traditional machine learning techniques, namely tree-boosting regressors [12,13]. More recent works do employ neural networks, but the vast majority of studies use simple feed-forward architectures, LSTMs or some hybrid combination of the two [14,15]. In contrast, generic time-series forecasting studies focus on multivariate Deep Neural Network (DNN) architectures [16,17,18]. The Temporal Fusion Transformer (TFT) [19] and Temporal Convolutional Networks (TCNs) [20] are popular choices that have been applied in a plethora of time-series forecasting problems such as financial time series [21,22], energy demand forecasts [23,24], weather forecasts [25,26] and anomaly detection [27,28]. Consequently, a bibliographic gap is observed as retail demand forecasting studies seem to fall significantly behind, not taking advantage of multivariate DNN forecasters, including well-established ones like the aforementioned TFT and TCN.

The perceived bibliographic gap extends as recent studies [29,30] suggest that Transformer-based models, such as TFT, can be outperformed by simpler-in-architecture linear models if designed properly. A recently proposed linear model with DNN accuracy levels is the TSMixer [31], which utilizes simple linear layers on both the time and feature dimensions of the input as well as on all possible covariates. Both the bibliographic gap and the TSMixer architecture motivate the current study, including the proposal of an extension to the TSMixer design, the LSTMixer, by utilizing a Long Short-Term Memory block to escape the purely linear structure whilst remaining low in complexity and, ultimately, increasing model accuracy.

In summary, the motivations behind the present work are the observed lack of DNN models, popular amongst generic time-series studies, within retail forecasting, as well as recent developments suggesting that linear models may outperform their DNN counterparts. To determine the most appropriate forecasting technique, all the aforementioned architectures were optimized, tested and compared against one another, including a proposed extension which aimed to balance the benefits of modern linear approaches and DNN designs. The contributions of the current study are as follows:

The bibliographic enrichment of the retail demand forecasting framework by introducing and comparing the well-established generic DNN models TFT and TCN.
The introduction of the recently proposed TSMixer model within the retail demand forecasting framework and its comparison against well-established DNN models.
The proposal, comparison and ablation testing of LSTMixer, an extension to the original TSMixer design.

The structure of the present work continues as follows: Section 2 presents the analysis of all the models utilized within the study. Section 3 introduces the dataset used, defines the forecasting problem to be solved and details the fine-tuning process of each model before testing. Section 4 provides and analyzes the forecasting and ablation testing results. Finally, Section 5 discusses the overall conclusions of the study.

2. Model Analysis

In the following paragraphs, the neural network models that are employed within this study are presented and analyzed. The TSMixer architecture, accompanied by the proposed extension, LSTMixer, is analyzed in detail, whilst the well-established DDNs are adequately overviewed as their architectures are well documented and were not altered for the present research. The LSTM model will also be briefly detailed as it is used to create the LSTMixer model.

2.1. The Long Short-Term Memory Network

Long Short-Term Memory (LSTM) neural networks are types of RNNs that use a unique architecture of memory cells in order to represent the long-term dependencies in time series [32]. The motivation behind their design and application is their resiliency against the vanishing gradient problem [33], meaning the loss of information when the time horizon extends long into the past, via its aforementioned memory cell state.

As depicted in Figure 1, the input of an LSTM is the previous cell state C, the previous hidden state H and the current value of the time series. For multivariate problems, the current value is a vector instead, containing the value of every time series. The output of the network is the new cell state as well as the new hidden state, which is also used as the forecast of the model during prediction time.

To create a deeper neural network, LSTM blocks may be stacked to create a multilayer LSTM network. This can be easily achieved by using the output of the first LSTM layer as the input for the second and so on and so forth. Dropout layers may be used in between. The cell and hidden states are independent for each layer and only the output is passed onto the next layers. To output multiple values a simple fully connected (FC) network is used to decode the final hidden state of the final block.

2.2. The Time-Series Mixer and the Proposed LSTMixer

The Time-Series Mixer (TSMixer) model, as introduced in [31], includes two main categories of linear blocks: the Time Mixer and the Feature Mixer. Each Mixer is designed to extrapolate dependencies given the time and feature axes, accordingly, hence better leveraging crossvariate information while maintaining low computational cost. The overall architecture is depicted in Figure 2.

To understand the inner workings of the TSMixer, the Time Mixing and Feature Mixing sub-architectures need to be initially explained. The Time Mixer block operates on the time-axis. To do so, given that the input is represented by a matrix whose rows represent the timesteps and the columns represent the features, the input needs to be transposed to expose the time-axis to the following linear layer. Hence, for an input matrix

X \in R^{L \times C}

the Time Mixing operation

T M

is defined as

{T M (X)}_{*, i} = N o r m (X_{*, i} + {D r o p (R e L U ({X_{*, i}^{T} W}_{1} + b_{1}))}^{T}), \forall i = 1, \dots, C

(1)

where

X_{*, i}

denotes the columns of the input,

W_{1} \in R^{L \times L}

is the weight and

b_{1} \in R^{L}

is the bias of the linear layer,

R e L U

is the elementwise rectified linear unit function defined as

R e L U (X) = m a x (0, X)

,

D r o p

is the dropout function, which randomly zeroes some of the elements of the input given a predefined probability, and

N o r m

represents a Layer Normalization as defined in [34]. A residual connection is also present in order to potentially ignore unnecessary mixing.

The Feature Mixer block follows the same logic but on the features column and contains two linear layers instead of one. Hence, for an input matrix

X \in R^{L \times C}

the Feature Mixing operation

F M

is defined as

{F M (X)}_{j, *} = N o r m (X_{j, *} + D r o p (R e L U ({U_{j, *} W}_{2} + b_{2}))), \forall j = 1, \dots, L

(2)

where U_{j, *} = D r o p (R e L U (X_{j, *} W_{3} + b_{3}))

(3)

X_{j, *}

denotes the rows of the input,

W_{2}, W_{3} \in R^{C \times C}

are the weights and

b_{2}, b_{3} \in R^{C}

are the biases of the linear layers, and

R e L U

,

D r o p

and

N o r m

are the same functions defined in (1).

The initial input of the TSMixer network is separated into three groups: the past, future and static matrices. As previously stated, the rows of the matrices represent the timesteps, but in each group their lengths are different. The past matrix has rows equal to the historically known values, the future matrix equal to the forecasting window and the static matrix equal to 1. This causes an issue as the TSMixer design concatenates these matrices. To address this, the static matrix is repeated and stacked to match the forecasting window while the past matrix is subjected to Temporal Projection. The Temporal Projection layer is a fully connected layer applied on the time domain. It simply performs a single linear transformation on the input to learn initial temporal patterns and to resize it from length L to the forecasting window length T.

After the inputs are properly resized, the past and future inputs are Feature Mixed and then the first round of Repeated Mixing starts: the static input is Feature Mixed and then all inputs are concatenated, Time Mixed and finally Feature Mixed. The output can then be concatenated with the Feature Mixed static input to repeat this mixing process. When completed, the resulting output is inputted to a fully connected network to generate the final output, the forecast of any desired horizon T.

The proposed LSTMixer extension appends an LSTM block to the original architecture, as analyzed within the previous paragraphs, that receives the projected past matrix as input and outputs its final hidden layer directly into the fully connected network, hence skipping all mixing. The idea is to utilize the LSTM block as an encoder to maintain information even after extensive repeats of mixing, as well as to introduce non-linearity to the model.

2.3. The Temporal Convolutional Network

The Temporal Convolutional Network (TCN) [19] is a DNN that applies multiple dilated, causal 1D convolutions along the time-axis of its input and creates an output of equal size. Its architecture can be presented at three different levels of abstraction, as overviewed in Figure 3:

The TCN network is a sequence of multiple Residual Blocks that perform dilated convolutions given a number of kernels k and a dilation factor that exponentially increases, given a base b. The sequence length, m, is automatically calculated so that the network achieves full history coverage.
Each Residual Block consists of two convolutional layers that are followed by a potential weight normalization, a ReLU activation that enables the model to achieve non-linearity, and a spatial dropout layer.
Each convolutional layer applies multiple filters given the number of kernels k and dilation factor b.

2.4. The Temporal Fusion Transformer

The Temporal Fusion Transformer (TFT) [20] incorporates multiple DNN techniques to fully utilize all types of covariate information and extract both short- and long-term correlations. The main structures within the TFT model, visualized in Figure 4, are the following:

Variable selection networks as well as a dedicated static variable encoder to properly handle and filter the input and all covariates that accompany it.
An LSTM encoder–decoder layer that encapsulates short-term correlations.
A multi-head attention layer that is able to extract long-term dependencies and benefits from the scalability that Transformer architectures enjoy.
Multiple residual connections and a dedicated Gated Residual Network (GRN) that skips potentially unnecessary functions within the network
A final linear layer that is tasked with outputting quantile predictions for probabilistic forecasts whenever such analysis is required.

3. Materials and Methods

The following paragraphs present the experimental setup by introducing the dataset and the retail demand problem to be addressed, as well as the preparation and optimization of the aforementioned neural network techniques and their initial parameters. Programmatically, the analysis, preparation, modeling and optimization of all the algorithms used are implemented via the Python programming language (version 3.11.8) and the following libraries: Numpy (version 1.26.4) [35], Pandas (version 2.2.1) [36], Scikit-learn (version 1.6.1) [37], XGBoost (version 3.0.0) [38] and PyTorch (version 2.2.0) [39].

3.1. Retail Demand—The Walmart Dataset

The retail demand dataset employed for forecasting, sourced from [40], includes the weekly sales from 45 different Walmart stores for 143 weeks. It also includes additional static and time-variable information in relation to each store which is categorized, as in Table 1, and later introduced to the forecasting models as static, past or future covariates.

After a basic cleaning of the dataset, namely replacing empty MarkDown values with 0 and encoding the store type with numeric values (0, 1 and 2), the first 2 years of datapoints were retained for further exploration, as they represented the training set (further clarified in Section 3.2). Next, a 0–1 normalization per component was performed to generate proper inputs to the neural network models as well as perform a fundamental statistical analysis, as presented in Table 2.

As presented, the 75% percentile is just 0.33, with the 100% percentile being 1 after normalization. This indicates the existence of numerous sales values that are extraordinarily higher than usual, thereby revealing potential outliers within the data. To further investigate this behavior, Figure 5 graphs the Boxplot of the mean sales of each store to detect potential outliers at the store level.

Six stores showed marginally higher numbers of average sales, five of which were all of Type C. Furthermore, there were only 6 Type C stores, with the final one having 0.41 mean sales, barely labeling it as a non-outlier. This suggests that store types do, in fact, directly affect sales and should be analyzed separately for potential outliers. Figure 6 presents the Boxplots of the mean sales per type.

It is now evident that the Type C stores actually contained no outliers but simply higher sales values. The Type B stores included one low-sales and two high-sales outliers, but when aggregating all stores (Figure 5) they were within the normal range. The unique true outlier was observed to be of Type A as it contained sales above the Type C median and was the sixth outlier graphed in the aggregate. In a practical setting, this store would potentially necessitate special treatment to better understand the reason for the abnormally high number of sales, accompanied by a separate method for prediction. In contrast, within the current work, maintaining the outlier store (or any potential outlier stores) as input is desirable as it “stress tests” each model and reveals their robustness against noisy or outlier cases, hence facilitating a healthier and fair comparison.

3.2. Problem Definition and Methodology

Each model previously presented was tasked with solving the following problem: Given a lookback window of 16 weeks (4 months), forecast the following 4 weeks (1 month). The lookback window needs to be long enough to supply all the necessary information but not extensively long, as this may generate noise. It could be treated as a parameter for optimization, with each model using a different lookback window to train and generate predictions, but this would lead to slightly different training sets for each method, which might deem the comparison as unfair or biased. The forecast window is oftentimes experimentally set as the next future value, but given the retail demand context of the current work, it is important to consider the practical implications of the forecasting methods presented and compared. In retail, it is highly important to forecast multiple timesteps in the future so as to properly manage stocking, delivery and inventory control [41]. Therefore, the forecasting window was set to predict 1 month ahead (4 timesteps) to ensure that each model was not only accurate enough but also employable in real-time scenarios.

Given the original dataset of 143 weeks, the first 104 weeks were chosen as the training set, with the last 39 providing the test set. A sliding window approach was employed to generate the training slices, with a step of 1, while the test slices were generated similarly but starting with the last 16 weeks of the training set as inputs to forecast the first 4 weeks of the test set and then sliding with a step of 4 so each test week was predicted only once.

Before training on the aforementioned training set, the structural parameters of each model needed to be determined. Therefore, a hyperparameter optimization phase was set up to fine-tune each method and determine the optimal design for the current retail demand problem. To do so, a validation set needed to be specified to assess the accuracy of every hyperparameter combination. The training set was further split, during fine-tuning, into a sub-training set of 78 weeks so as to create a validation set of 26 weeks (the final 6 months of the training set). Once the optimal models were determined, they were retrained on the entire training set before testing. The fine-tuning process is further detailed in Section 3.3.

Finally, each model forecasted the test set, and they were compared against one another based on multiple accuracy metrics, namely the Mean Absolute Error (MAE), the Round Mean Square Error (RMSE) and the symmetric Mean Absolute Percentage Error (sMAPE). The overall workflow, as described above, is presented in Figure 7.

3.3. Hyperparameter Optimization

Each model needed to be exhaustively trained and validated given a set or range of possible hyperparameter values. This fine-tuning process is excessively time-consuming and computationally expensive if not properly managed and limited. Both the Transformer- and the Convolution-based architectures experience significantly longer training times than the LSTM, TSMixer and LSTMixer models. Therefore, to properly optimize each model and generate acceptable results within reasonable timeframes, a maximum number of fine-tuning iterations needed to be set, accompanied by a hyperparameter sampler that could intelligently pick trial parameters that minimize loss. To accomplice this, the Optuna [42] optimization library was employed, setting the maximum number of iterations to 200, using the TPE Sampler (Tree-structured Parzen Estimator [43]) and Hyperband pruning [44] to halt and discard unpromising trials. The same options were set for every model, regardless of training time, to maintain the fairness of comparison. Finally, every neural network used the Adam optimizer [45] to minimize the Mean Square Error (MSE) loss. The number of epochs was statically set to 500, but a 50-round early stopping function stopped the training when no improvements were observed.

Additionally, a tree-boosting regressor method, XGBoost (eXtreme Gradient Boosting) [46], was also optimized to later provide an advanced benchmark comparison. It used the DART optimizer [47] and the number of estimators was set to 500, with 50-round early stopping so as to mirror the neural network setup.

Table 3 presents every model with their associated hyperparameters and the parameter sets used during optimization, denoting the optimal results in bold.

4. Results

4.1. Forecasting Results

Using the optimal parameters, each model was trained on the entirety of the training set and then tasked with forecasting the test set. In addition, instead of benchmarking against naïve forecasters or a moving average, two popular statistical models are also applied to the dataset to provide a realistic benchmark. The Exponential Smoothing (ETS) algorithm as well as the Seasonal ARIMA model (SARIMA—using the Auto-ARIMA algorithm [48]) forecast the test set store-by-store, as they can only be applied for univariate problems.

The results are depicted in Table 4, where each model is sorted from best to worst based on the median and mean metric errors for every store. The proposed LSTMixer scores higher than all other models, whilst the TFT and TCN DNNs score second and third, accordingly. In contrast, the TSMixer falls behind by a noticeable margin, especially compared with the proposed LSTMixer extension of its architecture. Interestingly, the pure LSTM model fails to output an adequate result, as it scores lower than the XGBoost regressor, but is an integral component of both models that scored the highest. This does indicate the practicality of the LSTM architecture as an encoder when capturing short-term correlations. An important comparison is that of the well-established neural networks with the baseline statistical models which are still widely used in practice. The ability to handle multivariate inputs and generate multistep outputs clearly results in significantly more-accurate predictions. A potential trade-off is the computation cost and time-constraints associated with DNNs and complex architectures based on Transformers or Convolution. But the linear TSMixer model is able to produce acceptable results with minimal computational cost, whilst the proposed LSTMixer optimizes the predictions and maintains a substantially lower computational cost than its well-established DNN counterparts. In conclusion, the LSTMixer model proves to be the most accurate model and also maintains the efficiency of TSMixer’s linear design.

To further verify the resulting conclusions, it is important to ascertain if the differences between forecasting accuracies amongst models are statistically significant. To do so, the Harvey, Leybourne and Newbold (HLN) test [49] is performed, using the squared-error, on each model forecast pair, and the calculated mean and median p-values are presented in Table 5. Notably, any method that has previously scored lower than the XGBoost model is excluded from the test, thereby validating the significance of the better predictors, which are the main focus of the current work, whilst using XGBoost as a benchmark option. As the resulting p-values are less than the empirical threshold of 0.05, the differences between all the forecasting accuracies are considered significant, with the sole exception of the TFT compared to the TCN. Therefore, the previous rankings and metric comparisons are indeed representative of each model’s forecasting ability and support the conclusion that LSTMixer is the most suitable predictor.

4.2. LSTMixer Ablation Study

By reviewing the forecasting results, it is clear that the fine-tuned LSTMixer outperforms the fine-tuned TSMixer. The fine-tuning process did not perform a full-grid search of every possible parameter but rather an optimized search with a limited number of trials (as analyzed in Section 3.3). Therefore, the resulting fine-tuned TSMixer was different than the TSMixer sub-model included within the LSTMixer model and was potentially optimized with a different subset of hyperparameters.

To address this, an ablation study was performed on the fine-tuned LSTMixer model. Two models were subsequently created and tested by removing or replacing LSTMixer parts:

TSMixer-Pure: By removing the LSTM block altogether a TSMixer model is created with the same parameters as the LSTMixer ones. Hence, the original architecture can be directly compared with the proposed one.
TSMixer-Pass: The LSTM block is replaced with a linear block that simply transforms the input to match the dimensions of LSTM’s output—the hidden state. This way, the actual benefit of the LSTM block is tested as the TSMixer architecture might generically benefit from this extra connection, that directly passes the unmixed input to the output, regardless of the block used.

The results of the ablation study are presented in Table 6. Both subsequent models underperform compared to LSTMixer, with TSMixer-Pass scoring significantly worse which signifies that the additional connection creates noise unless a proper block is utilized to encode the input. Consequently, the proposed extension of the TSMixer architecture is beneficial to the original design and strictly outperforms both TSMixer-Pure and the fine-tuned TSMixer model, as previously showcased.

5. Discussion

The findings within the present study clearly suggest that retail forecasting greatly benefits from introducing well-established time-series forecasting models that are under-explored within the retail demand framework. Additionally, while the TSMixer model fails to compare with the DNN architectures, the proposed extension, LSTMixer, scores higher than the competition as it minimizes the forecasting error metrics. The results themselves cannot be directly generalized for every retail demand forecasting problem but rather indicate that future research needs to slowly step away from traditional techniques and focus on the state-of-the-art methods employed in generic multivariate time-series forecasting problems. Given the ever-increasing complexities of retail demand forecasting, statistical approaches, such as ETS and ARIMA, are unable to produce acceptable results. Similarly, more advanced tree-boosting approaches, namely XGBoost, and simpler neural networks, like traditional LSTM, fail to outperform not only complex DNNs but also simpler linear networks tailored around time-series prediction.

Considering the proposed model, the LSTMixer architecture not only outperforms the original but also provides an efficient medium between complex DNNs with high computational costs and specialized linear approaches. The addition of the LSTM block within the TSMixer architecture allows the non-linear processing of information and is shown to be the sole reason for the increased forecasting accuracy.

Future work will focus on introducing more retail datasets of varying backgrounds to the LSTMixer model, as well as the original TSMixer, to further investigate and generalize the findings. Further modification of the architecture may also be implemented using different recurrent networks, adding residual connections or even extending the contents of the mixer blocks themselves.

Author Contributions

Conceptualization, G.T. and A.T.; methodology, G.T.; software, G.T.; validation, G.T. and A.T.; formal analysis, G.T.; investigation, G.T.; data curation, G.T. and A.T.; writing—original draft preparation, G.T.; writing—review and editing, G.T. and A.T.; visualization, G.T.; supervision, A.T.; project administration, A.T. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are openly available in https://www.kaggle.com/datasets/aslanahmedov/walmart-sales-forecast, accessed on 29 May 2025.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Eglite, L.; Birzniece, I. Retail sales forecasting using deep learning: Systematic literature review. Complex Syst. Inform. Model. Q. 2022, 30, 53–62. [Google Scholar] [CrossRef]
Bai, B. Acquiring supply chain agility through information technology capability: The role of demand forecasting in retail industry. Kybernetes 2023, 52, 4712–4730. [Google Scholar] [CrossRef]
Hwang, T.; Kim, S.T. Balancing in-house and outsourced logistics services: Effects on supply chain agility and firm performance. Serv. Bus. 2019, 13, 531–556. [Google Scholar] [CrossRef]
Al Humdan, E.; Shi, Y.; Behnia, M.; Najmaei, A. Supply chain agility: A systematic review of definitions, enablers and performance implications. Int. J. Phys. Distrib. Logist. Manag. 2020, 50, 287–312. [Google Scholar] [CrossRef]
Fildes, R.; Ma, S.; Kolassa, S. Retail forecasting: Research and practice. Int. J. Forecast. 2022, 38, 1283–1318. [Google Scholar] [CrossRef]
Makridakis, S.; Hyndman, R.J.; Petropoulos, F. Forecasting in social settings: The state of the art. Int. J. Forecast. 2020, 36, 15–28. [Google Scholar] [CrossRef]
da Veiga, C.P.; da Veiga, C.R.P.; Puchalski, W.; dos Santos Coelho, L.; Tortato, U. Demand forecasting based on natural computing approaches applied to the foodstuff retail segment. J. Retail. Consum. Serv. 2016, 31, 174–181. [Google Scholar] [CrossRef]
Ghadge, A.; Bag, S.; Goswami, M.; Tiwari, M.K. Mitigating demand risk of durable goods in online retailing. Int. J. Retail Distrib. Manag. 2020, 49, 165–186. [Google Scholar] [CrossRef]
Gastinger, J.; Nicolas, S.; Stepić, D.; Schmidt, M.; Schülke, A. A study on ensemble learning for time series forecasting and the need for meta-learning. In Proceedings of the 2021 International Joint Conference on Neural Networks (IJCNN), Shenzhen, China, 18–22 July 2021; pp. 1–8. [Google Scholar] [CrossRef]
Abbasimehr, H.; Shabani, M. A new framework for predicting customer behavior in terms of RFM by considering the temporal aspect based on time series techniques. J. Ambient Intell. Humaniz. Comput. 2021, 12, 515–531. [Google Scholar] [CrossRef]
Punia, S.; Shankar, S. Predictive analytics for demand forecasting: A deep learning-based decision support system. Knowl.-Based Syst. 2022, 258, 109956. [Google Scholar] [CrossRef]
Massaoudi, M.; Refaat, S.S.; Chihi, I.; Trabelsi, M.; Oueslati, F.S.; Abu-Rub, H. A novel stacked generalization ensemble-based hybrid LGBM-XGB-MLP model for Short-Term Load Forecasting. Energy 2021, 214, 118874. [Google Scholar] [CrossRef]
Islam, M.T.; Ayon, E.H.; Ghosh, B.P.; Chowdhury, S.; Shahid, R.; Rahman, S.; Bhuiyan, M.S.; Nguyen, T.N. Revolutionizing retail: A hybrid machine learning approach for precision demand forecasting and strategic decision-making in global commerce. J. Comput. Sci. Technol. Stud. 2024, 6, 33–39. [Google Scholar] [CrossRef]
Mediavilla, M.A.; Dietrich, F.; Palm, D. Review and analysis of artificial intelligence methods for demand forecasting in supply chain management. Procedia CIRP 2022, 107, 1126–1131. [Google Scholar] [CrossRef]
Seyedan, M.; Mafakheri, F. Predictive big data analytics for supply chain demand forecasting: Methods, applications, and research opportunities. J. Big Data 2020, 7, 53. [Google Scholar] [CrossRef]
Torres, J.F.; Hadjout, D.; Sebaa, A.; Martínez-Álvarez, F.; Troncoso, A. Deep learning for time series+ forecasting: A survey. Big Data 2021, 9, 3–21. [Google Scholar] [CrossRef] [PubMed]
Liu, X.; Wang, W. Deep time series forecasting models: A comprehensive survey. Mathematics 2024, 12, 1504. [Google Scholar] [CrossRef]
Wen, Q.; Zhou, T.; Zhang, C.; Chen, W.; Ma, Z.; Yan, J.; Sun, L. Transformers in time series: A survey. arXiv 2022, arXiv:2202.07125. [Google Scholar] [CrossRef]
Lim, B.; Arık, S.Ö.; Loeff, N.; Pfister, T. Temporal fusion transformers for interpretable multi-horizon time series forecasting. Int. J. Forecast. 2021, 37, 1748–1764. [Google Scholar] [CrossRef]
Bai, S.; Kolter, J.Z.; Koltun, V. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv 2018, arXiv:1803.01271. [Google Scholar] [CrossRef]
Dai, W.; An, Y.; Long, W. Price change prediction of ultra high frequency financial data based on temporal convolutional network. Procedia Comput. Sci. 2022, 199, 1177–1183. [Google Scholar] [CrossRef]
Ho, R.; Hung, K. Ceemd-based multivariate financial time series forecasting using a temporal fusion transformer. In Proceedings of the 2024 IEEE 14th Symposium on Computer Applications & Industrial Electronics (ISCAIE), Penang, Malaysia, 24–25 May 2024; pp. 209–215. [Google Scholar] [CrossRef]
Lara-Benítez, P.; Carranza-García, M.; Luna-Romera, J.M.; Riquelme, J.C. Temporal convolutional networks applied to energy-related time series forecasting. Appl. Sci. 2020, 10, 2322. [Google Scholar] [CrossRef]
Huy, P.C.; Minh, N.Q.; Tien, N.D.; Anh, T.T.Q. Short-term electricity load forecasting based on temporal fusion transformer model. IEEE Access 2022, 10, 106296–106304. [Google Scholar] [CrossRef]
Hewage, P.; Behera, A.; Trovati, M.; Pereira, E.; Ghahremani, M.; Palmieri, F.; Liu, Y. Temporal convolutional neural (TCN) network for an effective weather forecasting using time-series data from the local weather station. Soft Comput. 2020, 24, 16453–16482. [Google Scholar] [CrossRef]
Wu, B.; Wang, L.; Zeng, Y.R. Interpretable wind speed prediction with multivariate time series and temporal fusion transformers. Energy 2022, 252, 123990. [Google Scholar] [CrossRef]
He, Y.; Zhao, J. Temporal convolutional networks for anomaly detection in time series. J. Phys. Conf. Ser. 2019, 1213, 042050. [Google Scholar] [CrossRef]
Ayhan, B.; Vargo, E.P.; Tang, H. On the exploration of temporal fusion transformers for anomaly detection with multivariate aviation time-series data. Aerospace 2024, 11, 646. [Google Scholar] [CrossRef]
Zeng, A.; Chen, M.; Zhang, L.; Xu, Q. Are transformers effective for time series forecasting? In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023; Volume 37, No. 9. pp. 11121–11128. [Google Scholar] [CrossRef]
Das, A.; Kong, W.; Leach, A.; Mathur, S.; Sen, R.; Yu, R. Long-term forecasting with tide: Time-series dense encoder. arXiv 2023, arXiv:2304.08424. [Google Scholar] [CrossRef]
Chen, S.A.; Li, C.L.; Yoder, N.; Arik, S.O.; Pfister, T. Tsmixer: An all-mlp architecture for time series forecasting. arXiv 2023, arXiv:2303.06053. [Google Scholar] [CrossRef]
Siami-Namini, S.; Tavakoli, N.; Namin, A.S. The performance of LSTM and BiLSTM in forecasting time series. In Proceedings of the 2019 IEEE International Conference on Big Data (Big Data), Los Angeles, CA, USA, 9–12 December 2019; pp. 3285–3292. [Google Scholar] [CrossRef]
DiPietro, R.; Hager, G.D. Deep learning: RNNs and LSTM. In Handbook of Medical Image Computing and Computer Assisted Intervention; Academic Press: New York, NY, USA; pp. 503–519. 2020. [Google Scholar] [CrossRef]
Ba, J.L.; Kiros, J.R.; Hinton, G.E. Layer normalization. arXiv 2016, arXiv:1607.06450. [Google Scholar]
The Numpy Documentation. Available online: https://numpy.org/ (accessed on 29 May 2025).
The Pandas Documentation. Available online: https://pandas.pydata.org/ (accessed on 29 May 2025).
The Scikit-Learn Documentation. Available online: https://scikit-learn.org/stable/ (accessed on 29 May 2025).
The XGBoost Documentation. Available online: https://xgboost.readthedocs.io/en/stable/ (accessed on 29 May 2025).
The PyTorch Documentation. Available online: https://pytorch.org/ (accessed on 29 May 2025).
The Walmart Sales Forecast Dataset. Available online: https://www.kaggle.com/datasets/aslanahmedov/walmart-sales-forecast?select=train.csv (accessed on 29 May 2025).
Trauzettel, V. Optimal stocking of retail outlets: The case of weekly demand pattern. Bus. Logist. Mod. Manag. 2014, 14, 3–11. [Google Scholar]
The Optuna Documentation. Available online: https://optuna.readthedocs.io/en/stable/index.html (accessed on 29 May 2025).
Watanabe, S. Tree-structured parzen estimator: Understanding its algorithm components and their roles for better empirical performance. arXiv 2023, arXiv:2304.11127. [Google Scholar] [CrossRef]
Li, L.; Jamieson, K.; DeSalvo, G.; Rostamizadeh, A.; Talwalkar, A. Hyperband: A novel bandit-based approach to hyperparameter optimization. J. Mach. Learn. Res. 2018, 18, 1–52. [Google Scholar]
Kingma, D.P. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar] [CrossRef]
Chen, T.; Guestrin, C. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar] [CrossRef]
Vinayak, R.K.; Gilad-Bachrach, R. Dart: Dropouts meet multiple additive regression trees. In Proceedings of the Artificial Intelligence and Statistics, San Diego, CA, USA, 9–12 May 2015; pp. 489–497. [Google Scholar]
Dhamo, E.; Puka, L. Using the R-package to forecast time series: ARIMA models and Application. In Proceedings of the International Conference Economic & Social Challenges and Problems, Tiranë, Albania, 10 December 2010; pp. 1–14. [Google Scholar]
Harvey, D.; Leybourne, S.; Newbold, P. Testing the equality of prediction mean squared errors. Int. J. Forecast. 1997, 13, 281–291. [Google Scholar] [CrossRef]

Figure 1. The LSTM block, where H is the hidden state, C is the cell state, and X is the input.

Figure 2. The architecture of the Time-Series Mixer, with the optional LSTMixer extension on the left.

Figure 3. The architecture of the Temporal Convolutional Network as presented in [19].

Figure 4. The architecture of the Temporal Fusion Transformer as presented in [20].

Figure 5. Boxplot of mean sales per store.

Figure 6. Boxplot of mean sales per store for each type.

Figure 7. Overview of methodology workflow.

Table 1. The covariates of the Walmart dataset with brief explanations and their types.

Covariate	Explanation	Type
Type	The anonymized type of the store as A, B or C.	Static
Size	The size of the store.	Static
Temperature	The average temperature in the area.	Past
Fuel_Price	The average fuel price in the area.	Past
CPI	The consumer price index.	Past
Unemployment	The unemployment rate.	Past
isHoliday	This declares if the current week is/contains a holiday.	Future
MarkDown 1–5 (5 separate features)	Special ongoing promotions.	Future

Table 2. Statistical metrics for all sales in every store (aggregated).

Metric	Value
Mean	0.26
Std	0.18
25% Percentile	0.14
Median	0.20
75% Percentile	0.33

Table 3. Every optimized model, with list of hyperparameters and their fine-tuned values. Optimal findings in bold.

Model	Hyperparameters	Values
XGBoost	Dropout rate	0, 0.1, 0.2, 0.3
	Learning rate	10⁻¹, 10⁻², 10⁻³, 10⁻⁴
	Hessian	0.5, 1, 2
	Gamma	1, 10⁻¹, 10⁻², 10⁻³
	Max depth	4, 8, 16, None
LSTM	Hidden map size	16, 32, 64, 128
	Number of LSTM stacks	1, 2, 4, 8
	Number of hidden layers in FC	2, 4, 8
	Number of neurons per layer in FC	8, 16, 32, 64
	Dropout rate	0, 0.1, 0.2, 0.3
	Learning rate	10⁻¹, 10⁻², 10⁻³, 10⁻⁴
	Batch size	4, 8, 16, 32
TSMixer	Number of mixing rounds	2, 4, 8, 16
	Size of first layer in FM	32, 64, 128
	Size of second layer in FM	32, 64, 128
	Dropout rate	0, 0.1, 0.2, 0.3
	Learning rate	10⁻¹, 10⁻², 10⁻³, 10⁻⁴
	Batch size	4, 8, 16, 32
LSTMixer	Number of mixing rounds	2, 4, 8, 16
	Size of first layer in FM	32, 64, 128
	Size of second layer in FM	32, 64, 128
	Hidden map size	16, 32, 64, 128
	Number of LSTM stacks	1, 2, 4, 8
	Dropout rate	0, 0.1, 0.2, 0.3
	Learning rate	10⁻¹, 10⁻², 10⁻³, 10⁻⁴
	Batch size	4, 8, 16, 32
TCN	Kernel size	2, 4, 8
	Number of filters	2, 4, 8
	Dilation base	2, 3, 4
	Dropout rate	0, 0.1, 0.2, 0.3
	Learning rate	10⁻¹, 10⁻², 10⁻³, 10⁻⁴
	Batch size	4, 8, 16, 32
TFT	Hidden state size	16, 32, 64, 128
	Number of LSTM stacks	1, 2, 4, 8
	Number of attention heads	2, 4, 8
	Dropout rate	0, 0.1, 0.2, 0.3
	Learning rate	10⁻¹, 10⁻², 10⁻³, 10⁻⁴
	Batch size	4, 8, 16, 32

Table 4. The forecast error on the test set based on the sMAPE, MAE and RMSE metrics, sorted from best to worst. As there are 45 different forecasts, one per store, for each metric, we present both the median (top) and the mean (bottom).

Model	sMAPE	MAE	RMSE
LSTMixer	18.8	0.041	0.050
LSTMixer	21.7	0.046	0.059
TFT	19.8	0.042	0.052
TFT	22.9	0.047	0.060
TCN	20.4	0.046	0.060
TCN	23.5	0.051	0.067
TSMixer	22.2	0.045	0.059
TSMixer	25.2	0.054	0.070
XGBoost	23.1	0.043	0.053
XGBoost	26.2	0.054	0.068
LSTM	22.9	0.050	0.060
LSTM	27.5	0.056	0.071
SARIMA	24.5	0.051	0.064
SARIMA	28.1	0.058	0.072
ETS	28.4	0.069	0.088
ETS	30.3	0.069	0.091

Table 5. The p-values, median (top) and mean (bottom), between the forecasts of each model via the HLN test.

	LSTMixer	TFT	TCN	TSMixer
LSTMixer	-	0.049	0.049	0.037
LSTMixer	-	0.047	0.046	0.042
TFT	0.049	-	0.073	0.035
TFT	0.047	-	0.070	0.044
TCN	0.049	0.073	-	0.048
TCN	0.046	0.070	-	0.049
TSMixer	0.037	0.035	0.048	-
TSMixer	0.042	0.044	0.049	-
XGBoost	0.039	0.049	0.049	0.050
XGBoost	0.042	0.046	0.046	0.046

Table 6. The forecast errors, median (top) and mean (bottom), on the test set of the original LSTMixer and its two ablations, including the p-values of the HLN test between LSTMixer and each ablation.

Model	sMAPE	MAE	RMSE	p-Values
LSTMixer	18.8	0.041	0.050	-
LSTMixer	21.7	0.046	0.059	-
TSMixer-Pure	22.8	0.042	0.060	0.045
TSMixer-Pure	25.9	0.054	0.068	0.043
TSMixer-Pass	25.8	0.051	0.059	0.045
TSMixer-Pass	27.8	0.057	0.070	0.040

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Theodoridis, G.; Tsadiras, A. Retail Demand Forecasting: A Comparative Analysis of Deep Neural Networks and the Proposal of LSTMixer, a Linear Model Extension. Information 2025, 16, 596. https://doi.org/10.3390/info16070596

AMA Style

Theodoridis G, Tsadiras A. Retail Demand Forecasting: A Comparative Analysis of Deep Neural Networks and the Proposal of LSTMixer, a Linear Model Extension. Information. 2025; 16(7):596. https://doi.org/10.3390/info16070596

Chicago/Turabian Style

Theodoridis, Georgios, and Athanasios Tsadiras. 2025. "Retail Demand Forecasting: A Comparative Analysis of Deep Neural Networks and the Proposal of LSTMixer, a Linear Model Extension" Information 16, no. 7: 596. https://doi.org/10.3390/info16070596

APA Style

Theodoridis, G., & Tsadiras, A. (2025). Retail Demand Forecasting: A Comparative Analysis of Deep Neural Networks and the Proposal of LSTMixer, a Linear Model Extension. Information, 16(7), 596. https://doi.org/10.3390/info16070596

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Retail Demand Forecasting: A Comparative Analysis of Deep Neural Networks and the Proposal of LSTMixer, a Linear Model Extension

Abstract

1. Introduction

2. Model Analysis

2.1. The Long Short-Term Memory Network

2.2. The Time-Series Mixer and the Proposed LSTMixer

2.3. The Temporal Convolutional Network

2.4. The Temporal Fusion Transformer

3. Materials and Methods

3.1. Retail Demand—The Walmart Dataset

3.2. Problem Definition and Methodology

3.3. Hyperparameter Optimization

4. Results

4.1. Forecasting Results

4.2. LSTMixer Ablation Study

5. Discussion

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI