Comparing Prophet and Deep Learning to ARIMA in Forecasting Wholesale Food Prices

Setting sale prices correctly is of great importance for firms, and the study and forecast of prices time series is therefore a relevant topic not only from a data science perspective but also from an economic and applicative one. In this paper we examine different techniques to forecast sale prices applied by an Italian food wholesaler, as a step towards the automation of pricing tasks usually taken care by human workforce. We consider ARIMA models and compare them to Prophet, a scalable forecasting tool by Facebook based on a generalized additive model, and to deep learning models exploiting Long Short--Term Memory (LSTM) and Convolutional Neural Networks (CNNs). ARIMA models are frequently used in econometric analyses, providing a good benchmark for the problem under study. Our results indicate that ARIMA models and LSTM neural networks perform similarly for the forecasting task under consideration, while the combination of CNNs and LSTMs attains the best overall accuracy, but requires more time to be tuned. On the contrary, Prophet is quick and easy to use, but considerably less accurate.t overall accuracy, but requires more time to be tuned. On the contrary, Prophet is quick and easy to use, but considerably less accurate.


Introduction
The main aim of firms is profit maximization. To achieve this goal, the constant updating and forecasting of selling prices is of fundamental importance for every company. Although the digital transformation is a phenomenon that is involving all companies, from small to large, many of them still update prices by hand through logics that are not always clear nor objective and transparent, but rather based on the experience and expertise of those in charge of updating the price list. On the other hand, the automation of price prediction and update can provide a strong productivity boost by freeing up human resources, which can thus be allocated to more creative and less repetitive tasks. This also increases the morale and commitment of employees; it also speeds up the achievement of goals, and improves accuracy by minimizing human errors. Subjectivity is also reduced: once the operating criteria have been established, forecast algorithms will keep behaving consistently. This in turn means an improvement in compliance.
Besides the automation of price updates, the prediction of the sales prices charged to customers in the short term also holds great value. In general, organizations across all sectors of industry must undertake business capacity planning to be efficient and competitive. Predicting the prices of products is tightly connected to demand forecasting and therefore allows for a better management of warehouse stocks. The current economic crisis caused by COVID-19 has highlighted the value of such management optimization, stressing the importance of the companies' ability to minimize inventory reserves and just-in-time production models. Forecast models considered in this paper can contribute to keeping the difference between wholesale purchase prices and company's sales prices under control, in view of maximizing the gross operating income. They can therefore help companies avoid the risk of incalculable losses and ultimately improve their contractual capacity.
The present work proposes to deal with these topics by investigating and comparing different price forecasting models. The specific task we consider is that of predicting the prices of three food products sold by a medium/large-size local wholesaler based in central Italy. In such way, we investigate the predictability of wholesale prices, comparing the performance of traditional econometrics time series forecasting models with Facebook's Prophet and machine learning models. The main goal of this paper is therefore to develop a forecasting model that could represent a first step towards the automation of the price-setting process, thus effectively aiding the work of company employees. In this way, it aims to be of practical use to companies for the maintenance and management of price lists. Scalability and flexibility of the models presented in this paper are also an important point: for the sake of simplicity, we have applied the models to three different products, but we underline that the same models and algorithms can be easily applied to any product.
Time series forecasting has always been a major topic in data science with plenty of applications. For a general review of some of the most used tools, see for example [1]. Well-known traditional econometric methods are not always appropriate to study and forecast big and noisy time series data. This has generated particular interest in machine learning methods, bolstering data driven approaches that include a wide range of methods that have the advantage of not relying on prior assumptions and knowledge on data. See for example [2]- [5] for reviews focusing on the application of deep learning [6] to time series. Long short-term memory (LSTM) networks [7] and convolutional neural networks (CNNs) [8] are almost ubiquitous in time series forecasting with machine learning. CNNs are even more commonly used for image recognition and feature extraction. However, the forecasting accuracy of standalone CNNs can be relatively low [9].
The literature concerning economic time series prediction employing various methos -for classical to artificial intelligence ones -is very rich. Nevertheless, although we believe automatic updating mechanisms and forecasting of sale prices are of uttermost relevance, the literature on these topics is not as developed as one would expect. Most studies focus primarily on the implementation of models for the analysis and forecasting of general price levels (inflation) or commodity and stock market prices. The forecasting of food prices in China was considered by the authors of [10], [11]. In particular, Zou et al. [11] compared the performances of ARIMA, neural networks (NNs) and a combination of the two to forecast wheat prices in the Chinese market. Their findings showed that, overall, NNs perform best at the task. Neural networks were also employed in [12] to forecast monthly wholesale prices of two agricultural products. Ahumada and Cornejo [13] considered a similar problem, also taking into account possible cross-dependencies of different product prices. In [14] the author focused on sales forecasting using machine learning models, a topic similar to the one considered in the present paper. For more recent work on forecasting commodities prices see [15], where the authors forecasted gold prices, and [16] where the Levenberg-Marquardt Backpropagation (LM-BP) algorithm was applied to stock prices prediction. Other authors used machine learning methods for inflation forecasting [17], [18], also in comparison with more classical econometric models [19]. Xue et al. [20] recently presented a high-precision short-term forecasting model for financial market time series employing deep LSTM neural networks, comparing them with other NN models. Their results showed that LSTM deep neural networks have high forecasting accuracy for stock market time series. In 2020, Kamalov [21] evaluated multilayer perceptrons, CNNs and LSTM neural networks to forecast significant changes in stock prices for four major US public companies, showing that these three methods yield better results when compared to similar studies that forecast the direction of price change. For models similar to the ones considered in this work and applied again to stock indexes forecasting, see [22]. Hybrid ARIMA/neural network models where instead studied by the authors of [23]. Stock prices have also been forecasted using LSTMs in conjunction with the attention mechanism [24]. Machine learning models using LSTMs and CNNs are of widespread use in time series forecasting, well beyond the financial and economic realm. For recent work on time series forecasting using machine learning outside the economic and financial area see [25], an application to COVID-19 spreading forecasting, and [26] for an application of deep learning to Influenza prevalence forecasting.
In this paper we compare the performance of standard Autoregressive Integrated Moving Average (ARIMA) models [27], which we take as a benchmark, to Prophet -a forecasting tool developed by Facebook and based on a Generative Additive Model (GAM) [28] -and machine learning models exploiting LSTMs, both on their own and in combination with CNNs. ARIMA univariate models are considered a standard reference model in econometrics. The compared models are rather different, as are the datasets that they accept in input, making the comparison interesting. On one hand, Prophet's driving principles are simplicity and scalability; it is specifically tailored for business forecasting problems and handles missing data very well by construction. On the other, the NN models we construct allow for a multivariate regression, fully exploiting all the collected data, but also require some data pre-processing, as does ARIMA. Prophet has been compared to ARIMA models for the prediction of stock prices [29] and bitcoin [30].
Our results indicate that the combination of CNNs and LSTMs yields the most accurate results for all the three products, but require the longest and computationally more expensive tuning. On the contrary, Prophet performances were not brilliant, but model tuning and data preparation were particularly quick. ARIMA and LSTM-only neural networks showed good performances both in terms of accuracy and time required for model selection and training.
The rest of the paper proceeds as follows. Section 2 introduces the dataset features, discussing its properties and some pre-processing steps that were taken on it; it also briefly presents the three models under consideration, their set-up and tuning. In Section 3 we give the results obtained with the three approaches and compare them. We conclude in Section 4 by discussing the results of this study and providing an outlook on future perspectives in the light of the paper findings.

Dataset description and preparation
For this study we had access to a dataset comprising a total of approximately 260,000 food order records, reporting the following information: date of order, order number, unit price, article code, sold quantity, customer code, offer (if present) and offer type, unitary cost. The records were collected by the wholesaler in a period ranging from year 2013 to 2021. For the study conducted in this paper, we decided to focus on the three products with the most records, namely Carnaroli rice 1kg × 10 (henceforth product 1), Gorgonzola cheese 1/8 of wheel 1.5 kg (product 2) and Cured aged ham 6.5 kg (product 3). The forecasting task considered in this work was to predict the average selling price for the following week, for each of the selected products.
As a first thing, we chose to leave out all data following the outbreak of the COVID-19 pandemic. This was motivated by the huge impact that the lockdowns and restrictions imposed by the authorities had on the food and catering sector, introducing a major shock in sales trends at all scales. Therefore, we excluded all records dated later than March 9, 2020 (last day before the first national lockdown in Italy).
A preliminary data analysis revealed that the dataset contained a good number of outliers: for some of them, it appeared evident that this was due to incorrect typing of the product sale price.
To improve the quality of the dataset, we calculated the z-score of each record based on its price as z = (p −p (w) )/σ (w) , where p is the unit sale price andp (w) and σ (w) are the mean and standard deviation for the selected product, weighted by the quantity sold in each order. Then, we filtered out all records with |z| > 4. Figure 1 shows the price distribution for product 2, after the filtering.
In view of the subsequent time series forecasting and as a further step in dealing with data inaccuracies, we decided to resample the dataset with weekly frequency. This was done by cumulating the number of orders in each window and calculating the average sale price for each week. For later use in neural network models, when resampling data we also kept track of the following fields in the dataset: number served customers, number of orders, number of orders on sale, (weighted) average product cost, and (weighted) price standard deviation. Table 1 summarizes the main features of the resampled price time series for each of the products. In Figures 2, 3 and 4 we display the time series of sale prices and sold quantities after resampling. All prices, here and everywhere in the paper, are intended in euros (e). We then split the weekly dataset in the following way:   This choice was made in order to make the training set as large as possible, while also having validation and test sets that cover at least one year of data. As can be seen in Figures 2, 3 and 4, even after resampling the price time series have missing periods. Therefore, they are not suited for being given as input to ARIMA and neural network forecasting models, as they require evenly spaced data. To overcome this problem, we adopt the following strategy for all products: 1. Since the time series for all products have a long window with no data in the second half of year 2013, we do not consider this empty period and start right after it; 2. When occasional weeks with no data occur, we take the average of the preceding and following week prices and interpolate.
In this way we were able to fill in all empty weeks -in fact, after resampling the missing records were very sparse. Note that the above procedure is only necessary for preparing the dataset for ARIMA and NN models, as Prophet has no problems in handling missing datapoints. The size of the datasets for each product, both before and after removal of empty periods, is summarized in Table 2.

ARIMA models
ARIMA models [27] are among the most simple and used econometric approaches to univariate time series modeling. In this work, we implemented non-seasonal ARIMA models, neglecting the modulation effects of holidays and using therefore pure trend lines. In econometrics, it is quite customary when dealing with price variables to transform prices through a logarithmic map, since this generally leads to better results. We decided to follow this approach when using the ARIMA modelling, thus working with log(p t ) in the model fitting. As a first step we checked the stationarity properties of the time series. We performed the Augmented Dikey-Fuller unit root test using the built-in method in the statsmodels Python package. The results we obtained are qualitatively similar for all the three products we considered: for the log(p t ) time series one cannot reject the null hypothesis of the presence of a unit root, signalling the non-stationarity of the series. First differencing the series, i.e. considering ∆ log(p t ) = log(p t ) − log(p t−1 ), makes it stationary. Thus the log(p t ) series are integrated of order one, and accordingly the models we considered are ARIMA(p, 1, q).
In order to have a rough indication on the AR orders, p's, and on the MA orders, q's, we computed the sample autocorrelation function (ACF) and the partial autocorrelation function (PACF) for ∆ log(p t ). Recall that • for an exact MA(q), ACF is zero for lags larger than q; • for an exact AR(p), PACF is zero for lags larger than p.
As an example we show the plots of these functions for product 2 in Figure 5. In the ARIMA model selection and fitting procedures, we used a different dataset splitting scheme with respect to the one described above, in that the training set comprised years 2013-2018 (i.e. the union of the former training and validation sets). This is since we decided not to use the validation set to selected the hyperparameters p and q, instead exploiting the Bayesian Information Criterion (BIC) as a metric for model comparison [31]. Hence we took into account different combinations of p and q around the values suggested by the ACF and PACF plots, and eventually we selected the model with least BIC.

Prophet
Prophet is an open-source tool provided by Facebook Inc., available both in Python and R. For the current analysis, the Python package (with Python 3.9) was used. As explained by the authors [32], the idea leading to Prophet was to develop a flexible forecasting tool which is easy to both use and tune. The underlying model features a decomposable time series with three components: growth (or trend) g(t), seasonality s(t) and holidays h(t) (if present). In the present case, there are no obvious holidays to consider, as the wholesaler's customers are mainly restaurants and hotels, which tend to stay open during holidays. The time series is therefore decomoposed as where t encodes variations that are not taken into account by the model, and which are assumed to be normally distributed [32]. The Prophet model can be seen as a GAM [28]. In this framework, forecasting is phrased as a curve-fitting task, with time as the only regressor, so the model is univariate.
The trend function adopted for the problem under study is a piecewise linear function written as where k is a scalar coefficient, s i are the trend changepoints -i.e. S times s 1 , s 2 , ... , s S at which the angular coefficient of the trend is allowed to change -δ i are the rate adjustements, and γ j = −s j δ j are parameters used to make the function continuous. The algorithm starts with a number S = 25 of potential changepoints, placed in the first 80% of the time series in order to avoid responding to fluctuations in the last part of the series. Then, the actual changepoints are selected by putting a sparse prior of the kind δ j ∼ Laplace(0, τ ), with τ (tunable hyperparameter) regulating 1 the magnitudes rate adjustments. A larger τ means the model has more power to fit trend changes. As for seasonality, Prophet accounts for it using Fourier series, namely Since we considered weekly data with no other obvious expected seasonality effects but yearly ones, we had P = 365.25d. For weekly seasonality, the truncation parameter is set to N = 10 by the authors of [32] when modelling yearly seasonality, and we follow this specification. When performing the fit, a smoothing prior β ∼ N (0, σ 2 ) is imposed on the 2N components β = (a 1 , . . . , a n , b 1 , . . . , b n ) T , with σ a second hyperparameter (essentially acting as an L2-regularization parameter). Prophet fits its GAM using the L-BFGS quasi-Newton optimization method of [33] in a Bayesian setting, finding a maximum a posteriori estimate.

Neural Networks
The advent of artificial intelligence, in particular machine learning, has led to the development of a set of techniques that have proved to be very useful in many different areas. One breakthrough has certainly been deep learning [6], which has revolutionized our way of handling and exploiting information contained in data. Deep learning can effectively detect and model hidden complexity in data, automatically extracting features that should otherwise be extracted manually by dataset inspection.
A standard choice when facing problems involving time series is that of using LSTM neural networks, a kind of recurrent neural networks (RNNs) devised by Hochreiter and Schmidhuber in 1997 [7]. Like all RNNs, they can by construction handle data endowed with temporal structure, while also providing a way to deal with the vanishing gradient problem [34]. Here we will describe the application of LSTM NNs to the problem under study, both on their own and in combination with CNNs. Indeed, standard LSTM NNs for time series forecasting can be enriched with onedimensional convolutional layers that sequentially apply a unidimensional filter to the time series. Convolutions can be seen as non-linear transformations on the time series data. This enhances the model's capability to learn discriminative features which are useful for the forecasting and that can be fed to the LSTM layers that follow. The models developed in this work were trained and tested using Python 3.9 and TensorFlow 2.5.
Unlike in the ARIMA and Prophet case, with NNs we can exploit a larger fraction of the information available in the dataset by setting up a multivariate regression. However, since dates cannot be used as input variables to a NN, we made an addition to the fields listed in Section 2.1, performing a time embedding to provide information about seasonality. We did this by adding the columns week cos = cos(2πw/52.1429) , week sin = sin(2πw/52.1429) , where w is the week number (w = 0, 1, . . . , 52). Therefore, we had a total of 9 input columns that were passed to the NN models. A sample of the input dataset for one product is shown in Table 3.
To construct the actual training dataset (made of a tensor x t and a scalar y t , for each time t) that could be fed to LSTM neural networks, we then performed the following steps: • reshape data so that at each time t, x t is a n × 9 tensor containing the n last values of each time series in Table 3; • set y t = ∆ t+1 = p t+1 − p t as the variable to be used in the cost function.
In this way, the model learns to predict y (price variation) at each time based on information about the last n timesteps. Predicting the increment of the quantity of interest instead of the quantity itself is a well-known way to improve performance when training multivariate machine learning models. Moreover, we checked through the Augmented Dikey-Fuller test that the ∆ t time series was stationary. The number n of timesteps used depends on the model and will be specified later. The NN models tried in this paper for predicting product prices fall in two classes: those using only LSTM layers and those with CNN layers before the LSTM. We denoted these classes A and B, respectively.

Results
In this section we report results obtained with the three different approaches to forecasting -namely ARIMA, Prophet and deep learning -studied in this work.

ARIMA results
We considered first ARIMA models, to provide a standard performance benchmark with which to compare the other models developed in the rest of this work.
The best ARIMA models for the three products are given in Table 4. As outlined in Section 2.2, they were selected by considering the series of the price logarithms log(p t ) and using a least BIC criterion [31]. For the sake of comparison with the other models, we transformed back the log(p t ) series to the p t series to compute the root mean squared error (RMSE) between the predicted and observed increments∆ (t) =p t − p t−1 , p t being the predicted price at time t. In all cases, we checked also that the Ljung-Box statistics [35], The results obtained on the test set by the selected ARIMA models are summarized in Table 5. We also report values for the MAE (mean absolute error) and MAPE (mean absolute percent error).

Prophet grid search
As suggested in the Prophet documentation and reviewed in Section 2.3, one can tune the τ (trend changepoints prior scale) and σ (seasonality prior scale) hyperparameters so that the model fits data as well as possible. We did so by performing a grid search over τ and σ in the following way: for each combination of τ ∈ {0.005, 0.01, 0.05, 0.1, 0.5} and σ ∈ {0.01, 0.05, 0.1, 0.5, 1, 2}, we started by fitting the model over the training dataset, and predicted the price for the following week (first datapoint in the validation set). We calculated the squared error between predicted and observed price. Then, we moved on to the second datapoint in the validation set, performed a new fit using also the first validation set datapoint, and predicted the price for the following week. The whole process was repeated until the validation dataset was exhausted. For each product, the configuration with least RMSE was selected, yielding the results shown in Table 6.

Prophet forecasting
After selecting the best values of the parameters for each product, we employed them to specify the Prophet model in the test phase. Testing took place in the following way: we started by fitting the model over the entire training plus validation dataset, predicting the first data entry in the test dataset and calculating the squared error. We repeated the procedure for all entries in the test dataset, each time using all previous history, and calculated the achieved RMSE at the end of the process. Note that by employing this procedure, also the test dataset is progressively used to fit the model. We plot in Figure 6 the result of the fit over the entire dataset for product 1, i.e. the function that would be used to predict the unit price for the first week following the end of the test dataset. Figure 7 shows instead the plot of the predicted and observed increments -∆(t) and ∆(t) as defined in Eq. 5 -in the case of product 2. The performance of Prophet in forecasting the weekly price time series is summarized in Table 7. As done for ARIMA models, besides the RMSE parameter used in the fine tuning, we also report values for the MAE and MAPE.

NN grid search
As a third forecasting tool, we studied deep neural networks. Two different classes of models -class A and B as introduced in Section 2.4 -were analyzed. We performed two different grid searches to select the best model in each class, as we now briefly describe. The common features between the classes were: the usage of an MSE cost function and of the Adam optimization algorithm [37], with learning rate (or stepsize) α ∈ {0.0005, 0.001}; the adoption of an early stopping procedure, monitoring the cost function on the validation set with a patience of 5 epochs, while also setting an upper bound of 150 training epochs. Finally, data were standardized using a z-score normalization 2 , and we used a batch size of 32.
For class A, we trained NN models with the following architecture and hyperparameters: a number l ∈ {1, 2, 3} of LSTM layers with n u ∈ {32, 64, 96} neurons each and normal Glorot weight initialization [38]. Each LSTM layer was followed by a dropout layer, with dropout rate r ∈ {0.1, 0.2, 0.3}. The output layer consisted of a single neuron with linear activation function, again with normal Glorot initialization. For this class of models, we used a number of timesteps n = 4. The grid search over the hyperparameters l, n u , r, α was performed by initializing and training each model ten times, monitoring the cost function on the validation set and recording the best result obtained for every configuration. The best performing models for each product are reported in Table 8. The second class of models, class B, consisted in a combination of CNN layers and LSTM layers, as done for example in [15] and [22]. We added two one-dimensional convolutional layers, with a pooling layer in between. Each of the convolutional layers had f ∈ {10, 20, 30} output filters, kernel size k s ∈ {2, 4} and relu (rectified linear unit) activation function. Moreover, we tried to use same, causal or no padding in each of the conv1D layers: we dub the corresponding hyperparameters pad 1 , pad 2 . The 1D average pooling layer had pool size equal to 2 and no padding. This block was then followed by the same LSTM layers as for the first class of models. The reason behind adding CNN layers is that they can help make better use of the data history, improving the ability of LSTM networks to learn from long series of past data. The grid search was indeed performed with varying numbers of timesteps n ∈ 4, 8, 12. As for the previous class, each model was initialized and trained twice. Results for the grid search over the hyperparameters l, n u , r, α, f, k s , n are shown in Table 9. Figure 8 shows the trajectory of training and validation cost functions for the best-performing model in the case of product 1.  Note that the cost functions are calculated on rescaled data, therefore they cannot be directly compared to the values appearing on the first line of Table 9.

NN forecasting
We then used the models specified by the hyperparameter choices in Tables 8 and 9 to forecast price variations on the test set. We proceeded by training the each selected model on the entire training and validation set, for a number of epochs corresponding to the epoch at which the training had stopped during the grid search. Table 10 reports the test set performances of NN models obtained in this way. Although they played no role in model training and validation, here we also report the values of the MAE and MAPE metrics obtained on the test set. We already observe that class B models always outperform class A models in forecasting prices for all the three products. Figure 9 displays the forecasts made by the best NN model and compares it to the actual price variations, in the case of product 3.

Result comparison
We can now compare the results obtained with the various models that have been tried. The comparison can be made by looking at Tables 5, 7 and 10 We begin by noting that Prophet performances are considerably poorer than the ARIMA benchmark and neural networks. On one hand, this could be expected as Prophet is based on a model with relatively low complexity that prioritizes ease of use and of tuning. On the other, as mentioned in Section 3.2.2, Prophet was progressively fitted also on the test set, effectively using more data than the other models for the fit, so its performances have arguably benefited from that. Nevertheless, one point that needs to be stressed is that we were able to provide Prophet with the entire weekly price time series comprising all of 2013 data, with no need for particular data pre-processing. Therefore, although Prophet's performances were not brilliant for the problem studied in this paper, it could still be useful in certain contexts where quick, preliminary forecasts are needed.
Turning to the other models, we observe that those featuring both CNN and LSTM layers (class B) yielded the most accurate forecasts for all the three products. At the same time, tuning them required by far the longest time (approximately 20 hrs for each product using a NVIDIA RTX 3060 graphic card) with moderate improvement on purely LSTM (class A) models: considering the RMSE metric, the improvement amounts to about 9% for both product 2 and 3, while for product 1 the two classes yield similar results (class B has lower RMSE but higher MAE and MAPE). On the other hand, class A models required a considerably lower computational effort (of the order of 30 minutes to select the model hyperparameters for each product). Notice that once the grid search was concluded, training the best model on the entire training plus validation dataset required only a few minutes, both for class A and B models.
ARIMA models performed well if we take into account both the achieved values of the metrics and time necessary for tuning. They were less accurate than class B models, yielding RMSEs that were 23% higher RMSE for product 1, 7% higher for product 2, and 8% higher for product 3 3 . However, they required about the same tuning time as class A models, and performed better for product 2 and 3: the RMSE obtained by ARIMA models was 23% higher for product 1, but 2% lower for products 2 and 3. We remark that ARIMA is univariate, while a multivariate input dataset was used to train and test deep learning models: this highlights the effectiveness of the ARIMA approach for the problem under study, while at the same time suggesting that the additional data fields used to perform the multivariate analysis were not so informative for price prediction. We add that machine learning models were tuned over a larger parameter space than the others: the search grids were made of 54 and 8978 4 hyperparameter configurations for class A and class B NN models respectively, versus a maximum of 5 ARIMA configurations and the 25 of Prophet.
Another important aspect to consider in comparing the models is that dataset size strongly affects the performance of machine learning models. To this regard, we note that the sizes of the datasets were not very large (just over 200 for each product, as seen in Table 2), hence one could expect especially the NN models performance to further improve when more historical data is made available.

Discussion
In this paper, we have discussed the application of different methods to the forecast of wholesale prices. We put a standard econometric model (ARIMA) side by side with two different approaches to time series forecasting. These were rather diverse both in kind and complexity, going from a simple additive model using a piecewise linear trend and Fourier series seasonality (Prophet) to deep learning models featuring both convolutional and LSTM layers. The findings showed that while Prophet was quick to set-up and tune, requiring no data pre-processing, it was not able to come close to the performance of the other, well-established time-series forecasting models. Instead, we found that the best deep learning models performed better than ARIMA, but also required much longer times for the hyperparameter tuning.
The work done in this paper can be extended in many directions. First, it would be interesting to carry out a similar analysis also for the forecasting of sales, and to consider data with daily frequency instead of weekly. Sales forecasting with higher frequency can indeed be relevant for wholesalers and retailers. Second, a more refined version of the study would distinguish between different customers, as the selling strategies adopted by the wholesaler do certainly differ when dealing with customers of various kinds and sizes. Customer profiling is an extremely interesting and active avenue of applied research, which can clearly enhance companies' returns. Therefore, in the near future we plan to carry on the analysis by combining customer classification algorithms with time-series forecasting: understanding how and when each customer buys, how specific products are treated differently by different customers (identifying complementary and substitute goods for each of them), as well as relating the price elasticity of demand for different product/customers, are all aspects that could lead to economic benefits for the wholesaler. To this end, one would want to compare further machine learning models to dynamic panel data where prices for each product and customer and price demand elasticity are considered. The approach could lead to an essential gain in the accuracy of the forecast, and would be an important contribution to sales prediction analysis, which is becoming an increasingly important part of modern business intelligence [39], [40]. Another aspect of sure relevance would be to evaluate algorithms and models on data including the effects of the COVID-19 outbreak, to both learn how the market has transformed and help modifying selling strategies in adapting to the current rapidly changing situation.
The limits encountered in the application of the forecasting tools examined in this work encourage the evaluation of further models that could bring together the advantages of each approach. Finally, it would be relevant to apply one of the most exciting recent advances in machine learning, namely the attention mechanism [41], to the forecasting task we have considered. Work on this particular topic is definitely attractive and recently made its appearance in this field [42].
Note: The paper is an output of the project "Cancelloni Big Data Analytics (BDA)" at Cancelloni Food Service S.p.A. Data used in this work are property of Cancelloni Food Service S.p.A. and cannot be disclosed.