Forecasting Key Retail Performance Indicators Using Interpretable Regression

Foot traffic, conversion rate, and total sales during a period of time may be considered to be important indicators of store performance. Forecasting them may allow for business managers plan stores operation in the near future in an efficient way. This work presents a regression method that is able to predict these three indicators based on previous data. The previous data includes values for the indicators in the recent past; therefore, it is a requirement to have gathered them in a suitable manner. The previous data also considers other values that are easily obtained, such as the day of the week and hour of the day of the indicators. The novelty of the approach that is presented here is that it provides a confidence interval for the predicted information and the importance of each parameter for the predicted output values, without additional processing or analysis. Real data gathered by Follow Up, a customer experience company, was used to test the proposed method. The method was tried for making predictions for up to one month in the future. The results of the experiments show that the proposed method has a comparable performance to the best methods proposed in the past that do not provide confidence intervals or parameter rankings. The method obtains RMSE of 0.0713 for foot traffic prediction, 0.0795 for conversion rate forecasting, and 0.0757 for sales prediction.


Introduction
There are three key indicators to which retail managers pay special attention: foot traffic, which is the number of visitors entering a brick-and-mortar store, conversion rate, which is the ratio of the people who actually make a purchase, and total sales, over a certain period [1]. Predicting the future store performance within a reasonable time frame allows managers to plan the required sales force, in order to provide the necessary stock for dealing with the expected number of customers and sales, and to plan the cash flow. Researchers have developed some models with this goal in the past using analytical as well as data mining-based models.
In the last few years, there has been a trend to pay more attention to models that can not only accurately predict a certain value, but they also give more information regarding the behavior of the model. One additional piece of desirable information that a model may convey is the importance or "weight" (positive or negative) the input variables have in the prediction process. Knowing this, managers can better react to unexpected changes of the environment conditions, such as an extraordinary holiday, or promotion campaigns.

Related Work
Time series forecasting methods can be applied to predict foot traffic, conversion rates, and sales. However, the result will depend on the pedestrian context, which can include variables, such as time of the day, period of the year, geographical location, etc. Previous foot traffic forecasting works have attempted to predict pedestrians' movements in closed areas or, alternatively, in open areas [12].
For closed areas, one approach is to analyze the possible routes of pedestrians, e.g., [13]. However, this approach requires complete knowledge of the physical environment and the results are probabilistic; they can be translated into a pedestrian forecast only if the incoming visitors number is provided. For open areas, a typically used approach is to place physical sensors on strategic places and analyze the time series data of each one.
Trying to provide performance metrics for retail business, we attempted to predict the transition of pedestrian incoming visitors from an open area to a closed area that is based on data provided by sensors placed at the entrance of each store. This data are combined with sales and other commercial data, such as promotions or discounts.
This research is focused on supervised learning and interpretability of the algorithm, which is the possibility to track the decisions made by machine [14]. Supervised learning [15] is a machine learning technique that trains an algorithm or statistical model through examples. Nowadays, this is one of the most common techniques used in machine learning, since it provides maximum flexibility and enables computer programs find patterns from the given data [16].
In order to organize this section, and despite their relationship, supervised learning and interpretability related works are presented separately.

Supervised Learning in Retail
Supervised learning models can be grouped in classification and regression models, depending on the type of the target variables Y. Y are discrete variables in classification models; by contrast, Y are continuous variables in regression models. In our case, Y are continuous variables that are represented as quantities and time series, so we are focusing on regression models for store foot traffic, conversion rates, and sales forecasts.
In particular, forecasting foot traffic and sales can be considered time series forecasting problems. A time series is a set of observations, where each one is recorded at a specific time [17]. Time series forecasting algorithms have been applied in many areas, e.g., financial market prediction, weather forecasting, and machine prognosis. Retail is one of those areas [6,9,10,18].
Foot traffic and sales forecasting are of significant interest for retailers. For example, Lam et al. [6] highlight the importance of store traffic forecasting and propose a model for the optimization of store labor schedules based on expected traffic. Their model sets store sales potential as a function of store traffic volume, customer type, and customer response to sales force availability. This model first uses the log-transformation of the traffic variable to induce a constant variance, and then a difference with one week lag is applied to this transformed variable. Finally, autoregressive integrated moving average (ARIMA) components are used to adjust for auto-correlations in the residuals. This model was used for a two week ahead forecast and obtained a percentage-forecast error of about 30% on average.
Sales and conversion tickets are commonly modeled as continuous variables and predicted while using regression models, e.g., Yu et al. [8] implemented a Support Vector Regression (SVR) to forecast newspaper/magazine sales. They found out that demographic characteristics, such as gender, age, income, education, and occupation distributions, affect newspaper/magazine sales. However, the relevance of these characteristics is different for other kind of products. Arunraj and Ahrens [7] focused on predicting daily food sales; they used a seasonal autoregressive integrated moving average with external variables (SARIMAX) model, and they concluded that sales forecasts were influenced by external variables, such as day-of-the-week seasonality, month-of-the year seasonality, holidays, festivals, price reductions, and weather. Variations of influence variables among different kinds of products implied that the models should be trained with data dis-aggregated by product or store.
ARIMA is the most popular model to predict foot traffic, showing its effectiveness in many use cases [6,[19][20][21]. However, it lacks the modeling of non-linear relationships. Even though it does not perform well in some problems, the ARIMA model is typically used as a baseline model to show the performance of other methods. Cortez et al. [9] used data that were collected in a period of six months from a facial recognition camera to detect the number of daily visitors along with their gender in a sports store. They compared six approaches to predict daily store traffic, using ARIMA as one of their baseline methods, in three different cases. They only detected male visitors, only female and all visitors, forecasting up to a week ahead. Their input vector included the time of the event, special daily event (weekend or holiday; major sports or entertainment event), and weather conditions, such as maximum wind speed, temperature, humidity, and rain. They proposed two metrics to test the performance of the models in this problem: the Average Normalized Mean Absolute Error (ANMAE) and a novel metric named the Average Estimated Store Benefit (AESB). In the AESB metric, the manager of the store executes a forecast for N days ahead. Subsequently, an alternative plan is set with better management (e.g., promotions to bring more traffic). Each daily error is multiplied by an average cost that is related to a missed opportunity. Overall Cortez et al. [9] showed that the Support Vector Regression (SVR) was the best forecasting method, regardless of the visitors' gender and only using time as its input. They used a two-stage approach to obtain optimal performance; first, they used a sensitivity analysis proposed in [22] to discard the least relevant time lag and then they undertook a grid search to find the optimal hyper-parameters.
The previously mentioned method, SVR, is another model that is widely used in time series prediction. SVR is a machine learning model that implements the structural risk minimization inductive principle to obtain a good generalization on a limited number of learning patterns [23]. With a training set L, the goal of the method is to predict the outcome valueŶ of every element of the training set that has, at most, deviation from the real target variable Y [24]. A prediction is calculated as: where w is a vector in the same dimensional space as X and b is a constant, and K is a kernel function. Two types of SVR are typically used. The first is linear SVR, which uses a SVR with a lineal kernel, where the kernel function is the dot product. The second is the non-linear SVR with a Gaussian Radial Basis function (RBF) kernel that is defined as: The predictions need to be as flat as possible. This means that w needs to be as small as possible. One way to ensure this is to minimize the norm of w. However, this optimization is sometimes not feasible or errors need to be allowed. Vapnik [25] suggested the formulation of the optimization problem where the constant C > 0 determines the trade-off between the flatness of the mapping function and the amount up to which the deviation larger than is tolerated.
K-Nearest Neighbors (KNN) Regression is another non-parametric time series forecasting model [26]. It uses the set of K observations that are closest to an input variable X to compute a predictionŶ. Specifically, the KNN model matches X with the average of its k-nearest neighbors in the training set L. The closeness of a point is evaluated by a distance metric || · ||. An improved version of this method is called Weighted Nearest Neighbors (WKNN), which strongly motivated the creation of the new regression method that is presented in this work. The method uses a weighted average, where the weight of each neighbor is proportional to its proximity [27,28]. It also uses a weighted distance to compute the nearest neighbors. The weight vector is learned during the training phase while using gradient descent (Section 3.2).
Abrishami et al. [10] used data from 56 stores of different business, such as gyms, coffee shops, restaurants, and bars. These data were collected using wireless access points and it was used to forecast the store traffic one week ahead using a Random Forest (RF) and SVR model. This work, unlike [9,18], discarded the use of weather data, because no correlation was observed between traffic data and weather; moreover, its forecasting inaccuracy only added error to the prediction. They used the time of the event as an input vector, including holiday status, special event status, and location (e.g. closeness to schools and being a tourist city). As performance metrics, the Root Mean Squared Error (RMSE) was used, which is the difference between the estimated values and the actual value, as shown in Equation (3).
where Y is the vector of the true target variables andŶ is the vector of its predicted values, both being vectors of size n. The effect of each difference is proportional to the size of the squared error; thus, larger errors have a disproportionately large effect on RMSE. Consequently, RMSE is sensitive to outliers. The Mean Absolute Error (MAE) was another used measure, which computes the average absolute difference between the predicted valuesŷ and the real value y, as shown in Equation (4).
whereŷ and y are vectors of size n. The Mean Absolute Percentage Error (MAPE) is used for comparison between different datasets. Across all different businesses, the best results were obtained by the SVM model. Afterwards, the authors deepened their study in another work where they tested the performance of Long short-term memory (LSTM). The LSTM model obtained better performance than their previous work. In this new study, the feature vector consisted of a sequence of length 8, which contained the daily foot traffic from the same day in the previous eight weeks. The forecast window used was one month, but, as they used the data of the previous week, it was not possible to use their same feature vector in a real world setting.
In this work, we also analyzed the implementation of a Gaussian Process (GP) [29], which is a collection of random variables, where a finite number has a joint Gaussian distribution. A GP is completely defined by a mean m(x) and co-variance function K(x, x ) (y ∼ GP(m, K)). Let us consider a training set L with X a set of inputs and Y the set of outputs, and a validation set with an input X * and output Y * . Because the key assumption in GP is that the data are represented by a Gaussian distribution, the data can be represented as: where µ and µ * are the mean values of the training and validation set. Σ, Σ * , and Σ * * are the covariance of the training, training-validation, and validation sets. Y is known, so the target is to compute the value for a new observation Y * given Y. In order to look for a flexible solution, we tested Random Forest [30], which is an ensemble method, a method that uses multiple learning algorithms to obtain the best performance. RF is a combination of random trees (RT), where each tree depends on the value of a random vector sampled independently and with the same distribution for all trees in the forest. The output is computed as the average of each tree output. Each tree is grown and not pruned based on any error measure. This means that the variance of each one of these individual trees is high. However, by averaging the results, this variance can be reduced without increasing the bias.
Finally, we evaluated the Recurrent Neural Network (RNN) [31], which is a type of deep neural network that is specifically designed for sequence modeling.

Interpretable Regressions
The problem that we had testing the algorithms was, in some cases, the performance, but most meaningful to our context, was the interpretability of the results.
In this work, we will understand interpretability as the ability of a model to explain or present its results in understandable terms to a human being [32]. The interpretability depends on the method transparency, which connotes some sense of understanding of the inner mechanisms. Lipton [33] considered transparency at three levels. First at the level of the entire method (simulatability), at the second level individual components, such as parameters (decomposability), and, lastly, at the third level of the training algorithm (algorithmic transparency).
From previous research [5,34,35], we realized that the Dempster-Shafer theory [4] provides interpretability at the third level, as previously indicated. The Dempster-Shafer theory, which is also known as the theory of belief function, is defined, as follows: Let X be the set of all possible states of a system called frame of discernment. A mass assignment function m is a function satisfying: where 2 X is the combination of all possible states, φ is the null state, and A is each one of the states. The term m(A) can be interpreted as the probability of getting precisely the outcome A, and not a subset of A. Multiple evidence sources that are expressed by their mass assignment functions of the same frame of discernment can be combined using the Dempster Rule (DR). Given two mass assignment functions m 1 and m 2 , a new mass assignment function m c can be constructed by the combination of the other two while using the following formula: where K is a constant representing the degree of conflict between m 1 and m 2 and it is given by the following expression: Petit-Renaud and Denoeux [3] introduced a regression analysis algorithm that is based on a fuzzy extension of belief functions, called Evidential Regression (EVREG). Given an input vector X, they predict a target variable y in the form of a collection of evidence associated with a mass of belief. This evidence can be fuzzy sets, numbers, or intervals, which are obtained from a training set that is based on a discount function that takes their distance to the input vector X and it is pooled using the Dempster combination rule (Equation (7)). They showed that their methods work better than similar standard regression techniques, such as the k-Nearest Neighbors using the data of a simulated impact of a motorcycle with an obstacle.
In our case, the interpretability will be achieved by obtaining the weights for each component of the input. Achieving interpretability in this way allows for us to identify the most relevant variables for forecasting a certain value. This fact lets a retail store manager better react to sudden unexpected changes of these variables. For example, if the stores have a high weight value for holiday existence, then the sudden appearance of a day "like" a holiday will suddenly affect the performance of those stores.

Materials and Methods
The goal of this research is to predict the key performance indicators (foot traffic, conversion rate, and store sales). The problem of predicting each one of these retail store indicators can be modelled as a discrete time series forecasting problem. This problem has a set of observations x t made at fixed time intervals (e.g., a day), being recorded at a specific time t as input [17]. Accordingly, given a time series X = {x 0 , x 1 , . . . , x t }, the time seriesX = {x t+1 , . . . , x n } with n > t needs to be predicted, with [t + 1, n] being the forecast window.
The methodology for this research work can be summarized, as follows ( Figure 1): first, with the participation of experts from the Follow Up company. we identified the information needs that consisted not only of getting a reliable predictor for the three key store performance indicators for the next month, but also obtaining information regarding the main parameters that determine these expected numbers. A month can span from four to six weeks; in this work, a month will span exactly four-weeks, so the forecasting window will be 28 days. Previous works have found that small time windows (e.g., a week) are often applied in real-world settings [9,10,18]; however, store managers need a longer time window than a week in order to carry out a suitable business plan based on expected number of visitors and sales. The next step was to analyze the available data to determine its quality and reliability, and to identify the need for filtering invalid entries. Subsequently, by studying the state of the art of the regression models, we found out that the most accurate methods were the so called black-box models that do not convey more information than the predicted value. This motivated us to develop a new model that is based on an existing one, called EVREG, in order to improve its accuracy and interpretability. Having the data and model, we proceeded with the definition of the embedding vector, after which we could make the first prediction testings. Their results were analyzed in order to adjust/change the data filtering as well as the model and the data embedding until reaching an acceptable result. The next subsections describe, in more detail, the data acquisition and validation, the development of the regression model, and the data embedding.

Data Acquisition and Validation
The data registered by Follow Up during each hour of operation are the following: the number of visitors (foot traffic), number of tickets, and the total amount of sales (in Chilean currency units) of retail stores. Data on foot traffic are recorded using a three-dimensional (3D) stereoscopic camera that is programmed to recognize, follow, and count people in an enclosed area with the purpose of getting exact knowledge of how crowded the physical space is and the flow of people in it. The cameras have an accuracy of approximately 98% and it can be used both in and outdoors. The ones used specifically for this work were installed in closed retail stores. They are placed at the store entrances and strategic spots in order to capture as much visual information as possible. The way that they detect humans from other static or moving objects is by using video analysis software focused on identifying humans easily in a dynamic environment; this is achieved by using stereoscopic cameras that can make a 3D map of the place and objects moving in it. Once a camera detects people, it starts counting and following them. These cameras are all connected to a private network that allows communication among them and not get 'confused' when simultaneously detecting the same person. Data on the number of sales (tickets) and the total amount of sales are obtained directly from the store records. We accessed this data from an Application Programming Interface (API) supplied by Follow Up. The API provides the information on an hourly basis. The obtained data are detailed in Table 1. It should be noted that data in this field are usually proprietary and a valuable asset for the companies. This is the case with Follow Up, which cannot disclose the raw data.
The data are checked for reliability, in particular, a store for which we want to predict the key performance indicators should not have periods where data are missing. Only stores with more than four years of history are selected. Most stores available in the API did not comply with these requirements, so only 20 stores were selected for our experiments, because these ones had the longest history with no null or abnormal values. The extraction window used for the data will be four years, from August 2015 to the same month of 2019. When the stores are selected, the raw data are enriched to produce the input vector used by the method. The number of tickets in the last hour (number of sales).

Integer sales
The total sales in the last hour. Integer Figure 2 shows an example series of store visitors by hour from August 2015 to August 2019.

Weighted Evidence Regression Model
In this work, we chose to use the Weighted Evidential Regression model (WEVREG) presented by Panay et al. [5]. WEVREG is an interpretable regression method where the prediction of a new observation can be easily followed. This model is based on a previous one that was proposed by Petit-Renaud and Denoeux [3], which uses the Dempster-Shafer Theory together with a variation of K-Nearest Neighbors (KNN) to produce an evidence-based regression model. WEVREG extends this model by adding weights to each dimension of the attributes making certain parameters have more importance than others when deciding the most similar instances of the KNN process. In addition, this change allows measuring the importance of each attribute in the prediction.
WEVREG uses a distance and a discount function on a new observation to obtain the more similar vectors in a training set, and then it computes the expected value of the observation. Mathematically, the distance function d is a weighted euclidean distance that is defined as: where w is the weight vector of size k, and x ik and x jk are the values of vectors x i and x j at dimension k. WEVREG requires that the similarity between vectors be bounded in the [0, 1] interval. For this reason, a discount function is applied to the distance function presented. The discount function φ between vectors x i and x j using this distance measure will be defined as a Radial Basis Function (RBF): where γ is the radius of the function; intuitively, it defines how far the influence of a vector reaches. The discount function φ will represent the similarity between two vectors (a higher value means higher similarity); this similarity function can be used to compute the mass (influence) of each element in L given an arbitrary vector x using the Dempster rule of combination (Equation (7)), obtaining: where K is a normalization term that is defined by the DST restrictions as: In the above formulas, the values of the weight vector w and value of γ are parameters that are optimized during the training process using gradient descent.
Because of the uncertainty that exists when using DST, there is an additional value m * that must be considered, which is known as the domain mass or the uncertainty of the prediction, and it can be calculated using the following formula Unlike the previous masses that assign a value to a single result, this m * variable assigns a value to the entire range of the domain of y. Subsequently, this allows for us choose which value of the domain to use and, thus, obtain different predictions. For example, if we choose the mean value of the range, then we would have the prediction with greater likelihood. Alternatively, if we choose the lowest value of the interval, then we would have the lower bound of the prediction and, if we choose the highest value, then we would have the upper bound. Therefore, in addition to predicting a value, we can deliver a confidence interval of the prediction as a result.
The WEVREG method provides two types of interpretability, a global interpretability, which is feature importance obtained by inspecting the values of the weights w after they are optimized; and, a local interpretability because the predicted value of a single instance is explained by the attributes of its neighbors; the model can check and evaluate the masses of these neighbors and then identify the ones that are the most similar and their corresponding values.

Embedding
Besides the model chosen for predicting, we need to define the values of the input vectors; this process is known as embedding. Embedding is crucial for achieving good performance, especially in time series regression, where the data can be inputted in many different ways.
Given the importance of embedding, choosing one a priori can cause poor results in the prediction. Therefore, four different embedding methods will be tested in this work.
The first vector will consist of the dis-aggregated date features as year, as detailed in Table 2. It is assumed that, in some stores, the day of the week is highly correlated to the expected number of visitors or sales, e.g., a store that is located near offices is expected to receive high number of visitors during work days. However, the features that were used in Table 2 do not reflect the cyclic quality of time features. For example, the day of the week on the features varies from 0 (Monday) to 6 (Sunday). The furthest day from Sunday is Monday, whereas, in reality, Sunday and Monday are next to each other. Another feature vector will be tested, including these cycles. The new embedding has two features representing the vertical and horizontal components in each feature with a cyclic behaviour. The vector is detailed in Table 3. To build this circular representation, we can use the trigonometric functions sin and cos and thenassign an angle from 0 to 2π to the values for example 0 for Monday and 6 7 2π to Sunday. Note that this transformation effectively captures the cyclical state of these variables, because the distance function previously defined in the model (Equation (9)) is Euclidean and the decomposition into horizontal and vertical components makes contiguous elements have the same distance in the model.  Another variant of these vectors will be tested. For each feature vector presented, a sequence of previously observed values will be added, thus obtaining a total of four feature vectors. These new features will be a sequence of previous time steps of the variable commonly used in time series forecasting. Because the forecast is made with a month window, the values need to have a month of distance. For example, if the visitors for August 1st are being predicted, the previous values will consist of July 1st, June 1st, and so on. The number of months N for the feature vector will be set based on the proposed method performance. Table 4 details the new features.

Experiments and Results
The dataset consists of the number of recorded visitors, the number of sale operations, and the total sales of 20 retail stores from August 2017 to August 2019. Because this is a time series forecasting problem, validation methods, such as random split or cross-validation, cannot be used. The methods performance will be evaluated in a real world setting, with the prediction of a validation set that will consist of one month of data. Thus, data from August 2017 to July 2019 will be treated as a training set, and August 2019 will be the validation set. The Scikt-learn [36], Statsmodels [37], and Pytorch [38] libraries were used for the methods implementation. Scikit-Learn was used for GP, SVR and RF regressors; Statsmodels was used for its SARIMA implementation; Pytorch was used for its implementation of tensor operations and gradient descent. Pytorch was also used for the implementation of the LSTM neural network. Table 5 presents the specifications for the machine and software. The proposed method (WEVREG) will be compared against techniques previously used in store traffic forecasting. These are SARIMA, SVM-L1 (SVR), RF, GP, and LSTM neural network. Each method will have a single configuration for every store. The configuration for most of the methods was reached using a Grid Search approach, where the values of the parameters in each method are set based on the ones that yield the best performance (lower mean RMSE) while forecasting all stores. The hyper-parameters for the SARIMA method were found using the auto-correlation and partial auto-correlation coefficients.
The architecture used for LSTM was as recommended by Abrishami and Kumar [11], with a first layer of LSTM cells and a dense layer with an hyperbolic tangent as an activation function to obtain a single output.
Three measures will be used for assessing the prediction performance of each method. The first one is the RMSE (Equation (3)), another will be MAE (Equation (4)), and the last measure used will be the the Normalized Mean Absolute Error (NMAE), which has previously been used in traffic forecasting. The NMAE between two vectors is calculated as: where Y andŶ are vectors of size n. NMAE is the MAE normalized by the size of the interval of the real response, as shown in Equation (15). The performance of each method was calculated for each store using the various embeddings. Subsequently, the mean performance for all stores was computed. The results show the performance of WEVREG in every setting and only the best setting of the compared methods. Tables 6-8 show these results.   Figure 3 presents an example of the prediction of the number of visitors for a specific store obtained by the WEVREG method. This figure also shows the confidence interval that the proposed model is able to provide.  Figure 4 shows the attribute weights that were obtained after training the WEVREG model using the cyclic embedding. This result illustrates the interpretability that can be obtained from the model, since it is possible to know which attributes are the most important for the prediction and, thus, provide an additional value to the prediction. We also compare our model to previous ones regarding foot traffic, conversion rate, and sales prediction. Table 9 presents the methods to be compared. Observing Table 9, it is important to note that the various authors use different data to build their methods, so it is incorrect to perform a quantitative analysis on the models since their results are not comparable. However, we present a qualitative three-dimensional analysis (columns 4-6 in Table 9): the first dimension is the model accuracy, in which we distinguish three levels: high accuracy when the method has reported less than 15 % error, medium accuracy when the error is between 15% and 30%, and low accuracy if the error is greater than 30%. Interpretability is the second dimension: this column shows whether or not the authors present some type of interpretability of their results. Finally, the last dimension for comparison is the confidence interval (CI), which shows whether the method can produce CI in addition to its predictions.

Discussion
From the analysis of the results that are presented in the previous section, we can see that our approach manages to predict the key retail indicators correctly.
First, we can qualitatively analyze the behavior of the predictions by looking at Figure 3. We can see that, in general, the prediction is well adjusted to the real curve. The method has no problem detecting the peaks and valleys of the real values, although it does not reach the same upper and lower values. In particular, the model hardly reaches the extreme values of the prediction. This shortcoming can be explained by the nature of the prediction with KNN; note that, for predicting a 0 value (the minimum in our case), the model requires that all neighbors it observes must also have value 0; if any of them does not have a 0 value, then it "moves" the prediction towards the center.
Another feature of WEVREG is its ability of provide confidence intervals. In the same Figure, we can observe that almost all the real values are inside the confidence interval. However, we can note that the width of this interval is wide, covering about 30% of the prediction range. This could be because the vectors that are used by the methods after the KNN approach for each prediction are not very similar to one another, obtaining a high uncertainty for the process.
Another remarkable characteristic of the WEVREG model is its interpretability. The model gives the attribute weights after training, as shown in Figure 4. In this case, the model is using the cyclic embedding with the sequence of the six previous months for a particular store. From this figure, it is clear that, for this particular store, some components, such as the year or partial components of the hour and day of the month, are not really important for predicting foot traffic. Additionally, the most important features for predicting foot traffic seem to be the number of visitors observed in previous months. The previous month is the most important, and later observing a decrease in importance followed by an increase in the fifth and sixth month which could be because of the cyclic behavior of the visitors of this particular store.
Talking about performance, WEVREG achieves good performance overall, obtaining comparable results to the best methods that were tested in the literature. Table 6 shows the results for foot traffic predictions; the best model is RF achieving an RMSE of 0.0669, while the next best model is WEVREG using the circular embedding with RMSE of 0.0713. For the case of number of tickets predictions, Table 7 shows a similar behavior. The best two models are GP and RF with RMSE of 0.0712 and 0.0741, respectively. WEVREG using the sequential circular embedding is next reaching RMSE of 0.0795. Finally, Table 8 presents the results for sales predictions. In this case, all of the models, except ARIMA, achieve very similar performance results, the difference of RMSE between the best model and the second to last model is about 0.01; in our case, the best configuration is the normal embedding. Overall, RF obtains the best performance, and this result matches the one previously reported by Abrishami et al. [10]. WEVREG with sequential and circular embedding is the third best model for predicting foot traffic and conversion tickets. For the case of predicting foot traffic and tickets, we can see that WEVREG with circular and sequential embedding is also the third best model. In the case of sales prediction, our model seems to achieve worse performance than the others, but actually all models, except ARIMA, obtain very similar performance values.
Because these are time series problems, it was expected that LSTM, which is a deeplearning based method, would obtain the best results, but it could not outperform our proposed method in general. A possible explanation for this low performance could be the use of a single network architecture for all the stores and the size of the dataset being used. Abrishami and Kumar [11] obtain different results when compared to the presented one using a specific architecture in each tested store.
From the analysis of Table 9, we can see that our method is the only one that is capable of predicting the three indicators at the same time and having high accuracy, having interpretability, and providing confidence interval. We can observe that the rest of the methods may be classified into two groups: those that are derived from ARIMA, which present interpretability and the possibility of having CI, but having medium or low accuracy. The other group is composed by models that aare derived from machine learning methods; they achieve high performance, but do not have interpretability or CI.
In addition, different vector representations were tested in this work, and we split the results of WEVREG to these different embeddings to compare them. The circular embedding is the one yielding the best performance when considering all tested methods. It seems that the breakdown of cyclic features in horizontal and vertical components enables the methods to create a better comprehension of the variables as we anticipated when we introduced this embedding. In particular, for our method, the circular representation with the sequence of previous months is the one yielding best performance in most predicted variables. Another interesting result is that using circular embedding in other methods also implies better performance, even when these methods are not distance-based, which shows that this representation also eases partitioning the feature space for these methods.

Conclusions
In this work, we present a new interpretable regression model to predict key retail indicators. The results showed that our method achieves levels of precision that are similar to other methods reported in the literature for this same problem. In addition to the prediction methodology, the main contribution of this work is that the proposed model is transparent, which allows for us to discover and analyze the main features that influence the prediction. This allows store managers to obtain knowledge about which are the main variables that influence the behavior of customers, thus allowing them to react to sudden changes in the scenario, before the model adapts itself. Another important contribution that is not present in most previous works, is that it also gives a confidence interval for the predicted values. This method also produces prediction for the three key performance indicators at the same time. These are important implications for practitioners, since these advances are applicable to the stores operation.
As we mentioned earlier, this model can be used to support different processes in the retail business. For example, the foot traffic prediction for a store can be used to plan the number of staff members needed per day. Sales prediction allows for creating more specific sales goals or assigning risk values for meeting these goals. Additionally, these predictions can be used to evaluate the effectiveness of promotions, e.g., by comparing the predicted sales numbers with the actual ones of a promotion campaign.
Because the proposed model is based on KNN, one of its main limitations is that neighbors have to be similar to the scenarios to be predicted in order to generate forecasts with high levels of accuracy. Although a prediction can be made using less similar scenarios, these predictions tend to have high uncertainty. This explains the restriction of having at least four years of data imposed in our study, as explained in Section 3.1. This limitation makes the model unsuitable to operate with new stores that do not have a substantial data history. Another limitation occurs when the model predicts stores located in tourist places. This type of stores drastically change their behavior during vacation periods when compared to the rest of the year. Thus, during regular periods, they have low values and are almost always 0 for the indicators. When the holiday period begins, these indicators increase, sometimes even exceeding those of stores in large cities. We have observed that our model is not capable of predicting this type of behavior, even when the store complies with the constraint of having four years of data.
The proposed model can easily be adapted to any time series regression problem. As mentioned in the introduction, the base model was previously used for predicting health care costs, which is a different type of problem and field from the one that is presented in this work. The main activity that is required to adapt this method to other scenarios would be defining the most convenient embedding for the input variables, since this is one of the most important factors affecting the success of its application.
Another important aspect to mention is the behavior of the model in anomalous situations. A clear example is the COVID-19 pandemic, which caused lockdowns in major cities, forcing retail stores to close or at least to limit the number of customers as stated by Wang et al. [39]. A model, like the proposed one, and probably all models based on historical data, cannot predict this behavior that is generated by external causes. Instead, our model can predict the expected sales for the period of time the store was closed and, thus, these values can be used to estimate the losses that are generated during the period of closure.
Finally, as future work, the model may perhaps be used to create new tools to support the retail processes mentioned above. For instance, it may be possible to build a tool to evaluate the effectiveness of a promotion or another tool suggesting the optimal changes to make to the key indicators in order to achieve a sales goal that is based on the model predictions. Regarding the model itself, future research may study a possible improvement to make the model operate correctly with little data to make it suitable for new stores; a tentative strategy for this purpose may be to cluster similar stores in the training phase and use the ones with the longest history to fulfill the history of new stores. Related to the above, how the different embeddings affect the results when there is little data could also be tested. An embedding that, instead of creating many features, only generates a few ones may perhaps work better for stores with little data. Lastly, another line of future research is to try to improve the interpretability of the model even more. Currently the model elaborates an importance ranking for the features in the prediction; however, it does not report whether these features positively or negatively affect the output. An example of this hypothetical interpretability would be the method showing that sales increase between 14:00 and 18:00 hours, or that foot traffic decreases on Wednesdays.