Future-Aware Trend Alignment for Sales Predictions

: Accurately forecasting sales is a signiﬁcant challenge faced by almost all companies. In particular, most products have short lifecycles without the accumulation of historical sales data. Existing methods either fail to capture the context-speciﬁc, irregular trends or to integrate as much information as is available in the face of a data scarcity problem. To address these challenges, we propose a new model, called F-TADA, i.e., future-aware TADA, which is derived from trend alignment with dual-attention multi-task recurrent neural networks (TADA). We utilize two real-world supply chain sales data sets to verify our algorithm’s performance and e ﬀ ectiveness on both long and short lifecycles. The experimental results show that the accuracy of the F-TADA is better than the original model. Our model’s performance could be further improved, however, by appropriately increasing the length of the windows in the decoding stage. Finally, we develop a sales data prediction and analysis decision-making system, which can o ﬀ er intelligent sales guidance to enterprises.


Introduction
Accurate sales forecasting is crucial for supply chain management. Overestimation or underestimation can affect inventory, cash flow, business reputation, and profit. Hence, it has attracted attention from both academic and industrial worlds. Essentially, sales prediction can be formulated as a time series forecasting problem, which is usually solved by the autoregressive model (AR) [1] or autoregressive moving average model (ARMA) [2,3]. The AR and ARMA models are suitable for stationary time series, but most time series data are non-stationary, so various linear and non-linear time series models [4], namely autoregressive integrated moving average (ARIMA) [5], seasonal-ARIMA, the seasonally decomposed autoregressive (STL-ARIMA) algorithm [6], the autoregressive conditional heteroscedasticity model (ARCH), and generalized autoregressive conditional heteroskedasticity (GARCH), have come into being. Moreover, autoregressionmethods that can model cointegration (autoregressive distributed lag (ARDL)) [7] or estimate covariance functions (stochastic autoregressive moving average (ARMA) processes) [8] have been proposed. The above regressive algorithms are not domain-specific, and they have been applied in various fields, such as sales forecasting [9] and wind speed prediction [10]. Some works even make the model and code publicly accessible, e.g., Taylor and Letham [11] put forward an additive model and made the model code open-source, which will help analysts to solve a large number of time series prediction problems. Recently, the prediction of time series data (e.g., finanicial data, electronic health records, and traffic data) has also attracted wide-spread attention in machine learning fields that first need to transform time series into hand-crafted features. For example, Yu et al. [12] used the support vector regression (SVR) algorithm to predict the sales of newspapers and magazines. Based on SVR [13], Kazem et al., proposed a stock prediction method. Gumus et al. predicted crude oil prices by using the XGBOOST algorithm with high accuracy and efficiency [14].

•
We reproduce and improve the method in the TADA paper and propose a multi-attention machine and trend adjustment algorithms featuring the integration of future known features, which is called F-TADA. • We analyze two typical real-world time series data. The experimental results show that the performance of the new method is better than that of the original algorithm. To apply the sales forecast algorithm to the intelligent decision-making process, we develop a sales data forecast and analysis decision-making system. This system has a grouping module and a sand table simulation module, which can provide better guidance for the enterprise sales decision-making process.
To derive the proposed modeling procedure and guidelines, the rest of the paper is organized as follows. Section 2 presents a brief introduction about two real-world data sets, including supermarket sales data and pesticide sales data. Section 3 describes the foundational model and shows how we Information 2020, 11, 558 3 of 19 derived our new model. Section 4 displays the experimental settings and provides a results analysis. Section 5 demonstrates our developed sales prediction system. Section 6 summarizes the findings and draws conclusions on the effectiveness of our model. Further, we outline the shortcomings of the model and list some future possible directions.

Data Introduction
Typical sales data can be divided into long period and short period data. We use supermarket chain sales as long-period time-series forecasting data and pesticide sales as short-period time-series forecasting data.

Description of Supermarket Sales Data
The supermarket sales data set comes from a contest by Kaggle, which consists of training data, store data, item data, transaction data, oil data, and holiday event data. The basic information of the data set is as follows: 1.
Training data contain the date, store, and items sold.

2.
Store data contain the store details, such as store location and store type. 3.
Item data contain the characteristics of a commodity, such as perishability, and the type of commodity. 4.
Transaction data contain the number of sales per store in the training data. 5.
Oil data and holiday event data contain the prices of daily oil and holiday information. Figure 1 illustrates the total sales trends. It can be found that the sales increase slowly with the growth of time, with a surge around Christmas at the end of each year, which indicates that the holiday factor is a very important element for sales forecasting.

Introduction
ical sales data can be divided into long period and short period data. We use supermark les as long-period time-series forecasting data and pesticide sales as short-period time-ser ing data.
ermarket Sales Data scription of Supermarket Sales Data e supermarket sales data set comes from a contest by Kaggle, which consists of training da ta, item data, transaction data, oil data, and holiday event data. The basic information of t is as follows: ining data contain the date, store, and items sold. re data contain the store details, such as store location and store type. data contain the characteristics of a commodity, such as perishability, and the type modity. nsaction data contain the number of sales per store in the training data. data and holiday event data contain the prices of daily oil and holiday information.
ta Analysis ure 1 illustrates the total sales trends. It can be found that the sales increase slowly with t of time, with a surge around Christmas at the end of each year, which indicates that t factor is a very important element for sales forecasting. shown in Figure 2, we drew a heatmap for the sum of 12 different months and seven day ery year. The sales volume on weekends increased sharply compared with the weekda hich is in line with the consumption habits of human beings, as people are used to buyi n weekends. Looking at the sales of each month longitudinally, the sales volume in Decemb r than that during the remaining 11 months because December sales are generally boost stmas shopping sprees. As shown in Figure 2, we drew a heatmap for the sum of 12 different months and seven days a week every year. The sales volume on weekends increased sharply compared with the weekdays sales, which is in line with the consumption habits of human beings, as people are used to buying goods on weekends. Looking at the sales of each month longitudinally, the sales volume in December is greater than that during the remaining 11 months because December sales are generally boosted by Christmas shopping sprees.

Figure 2.
Heatmap generated from the total sales by month and week. Note that the horizontal axis shows the 12 months, the vertical axis shows the period from Monday to Sunday, and the depth of color shows the amount of sales in that area.

Data Processing
For the supermarket sales data, we used Python and Pandas to preprocess the data to obtain the time series prediction data. All the data were first grouped into 54 stores according to the stores, and then each commodity in the 54 stores was also grouped. We used one-hot encoding to encode the characters and numbers. At the same time, we used the embedding dimension reduction method to reduce the dimensions of some features. The final data are composed of multiple files, and each file represents the time series sales data of a certain commodity in a certain store.

Description of Pesticide Sales Data
Data set 2 is the pesticide sales data provided by a pharmaceutical company. After cleaning and preprocessing the original data, the statistical data were obtained. The data were collected from the annual promotion activity table for 2017; the annual county-level consumption data and rice crops were combined to obtain the basic tables. We also used Python to crawl the data of the planting area on the internet. As the original data featured simple regularity, there were many missing values. Moreover, the data with time changes were found to be uncommon, so additional data on the weather and climate were used to help the training.
After data preprocessing and feature extraction using XGBoost [26], we obtained the final information data tables, including the date, downstream distributors, downstream distributor provinces, regions and classifications, regional brands, brands, volumes, artificial grassland, tea gardens, natural grassland, average surface temperature, average pressure, average temperature, average wind speed, sunshine time, and the number of activities in the month.

Data Analysis
The pesticide data are shown in Figure 3. Clearly, the seasonal trends of pesticides sales decreased significantly at the beginning and end of 2017. Seasonal factors affected not only crop growth, but also sales volume. Heatmap generated from the total sales by month and week. Note that the horizontal axis shows the 12 months, the vertical axis shows the period from Monday to Sunday, and the depth of color shows the amount of sales in that area.

Data Processing
For the supermarket sales data, we used Python and Pandas to preprocess the data to obtain the time series prediction data. All the data were first grouped into 54 stores according to the stores, and then each commodity in the 54 stores was also grouped. We used one-hot encoding to encode the characters and numbers. At the same time, we used the embedding dimension reduction method to reduce the dimensions of some features. The final data are composed of multiple files, and each file represents the time series sales data of a certain commodity in a certain store.

Description of Pesticide Sales Data
Data set 2 is the pesticide sales data provided by a pharmaceutical company. After cleaning and preprocessing the original data, the statistical data were obtained. The data were collected from the annual promotion activity table for 2017; the annual county-level consumption data and rice crops were combined to obtain the basic tables. We also used Python to crawl the data of the planting area on the internet. As the original data featured simple regularity, there were many missing values. Moreover, the data with time changes were found to be uncommon, so additional data on the weather and climate were used to help the training.
After data preprocessing and feature extraction using XGBoost [26], we obtained the final information data tables, including the date, downstream distributors, downstream distributor provinces, regions and classifications, regional brands, brands, volumes, artificial grassland, tea gardens, natural grassland, average surface temperature, average pressure, average temperature, average wind speed, sunshine time, and the number of activities in the month.

Data Analysis
The pesticide data are shown in Figure 3. Clearly, the seasonal trends of pesticides sales decreased significantly at the beginning and end of 2017. Seasonal factors affected not only crop growth, but also sales volume.

Data Analysis
The pesticide data are shown in Figure 3. Clearly, the seasonal trends of pesticides sales decreased significantly at the beginning and end of 2017. Seasonal factors affected not only crop growth, but also sales volume.  We sorted the importance of the selected features to see if the information crawled on the site was valid. The advantage of using the XGBoost [26] algorithm to predict the sorted data is that after successfully creating the lifting tree, we obtained the importance score of each attribute, allowing the data to be preprocessed to extract more important features. Then, we discarded less important features. The accuracy of the predictions with the data for Liaoning province was 0.56. Figure 4 shows the feature importance ranking determined by the XGBoost algorithm, such as the average temperature, average ground temperature, precipitation, and average air pressure. Here, the temperature, precipitation, and number of activities have the greatest impact, while sand features and marsh characteristics are discarded. We sorted the importance of the selected features to see if the information crawled on the site was valid. The advantage of using the XGBoost [26] algorithm to predict the sorted data is that after successfully creating the lifting tree, we obtained the importance score of each attribute, allowing the data to be preprocessed to extract more important features. Then, we discarded less important features. The accuracy of the predictions with the data for Liaoning province was 0.56. Figure 4 shows the feature importance ranking determined by the XGBoost algorithm, such as the average temperature, average ground temperature, precipitation, and average air pressure. Here, the temperature, precipitation, and number of activities have the greatest impact, while sand features and marsh characteristics are discarded. . The importance ranking of features. Note that the features are average temperature, average ground temperature, precipitation, average air pressure, average wind speed, activity time in the month, number of activities in the month, activities in the month, sunshine hours, cultivated land subtotal, orchard, woodland, other woodland, irrigated farmland, garden subtotal, cultivated land, cultivated paddy field land, woodland shrub land, grassland subtotal, evaporation, agricultural facility land, woodland, other grassland, ridges of fields, medication regulation, artificial grassland, natural grassland, tea gardens, sand, and swamps.

Preprocessing
Here, the missing value of the weather class is filled by the average value for the last month and next month, and the missing value of the land area class is filled by the average area of the province. The missing activity information data is set to 0, but the data for missing provinces and cities are abandoned. We process the digital data features and category features in the same way as data set 1.

Existing Deep Learning Methods
There is no time connection between inputs in traditional neural networks, but the recurrent neural network RNN has a natural advantage for data with a certain time connection. Because of the problem of gradient disappearance, the gated recurrent unit (GRU) is proposed by changing the Figure 4. The importance ranking of features. Note that the features are average temperature, average ground temperature, precipitation, average air pressure, average wind speed, activity time in the month, number of activities in the month, activities in the month, sunshine hours, cultivated land subtotal, orchard, woodland, other woodland, irrigated farmland, garden subtotal, cultivated land, cultivated paddy field land, woodland shrub land, grassland subtotal, evaporation, agricultural facility land, woodland, other grassland, ridges of fields, medication regulation, artificial grassland, natural grassland, tea gardens, sand, and swamps.

Preprocessing
Here, the missing value of the weather class is filled by the average value for the last month and next month, and the missing value of the land area class is filled by the average area of the province. The missing activity information data is set to 0, but the data for missing provinces and cities are abandoned. We process the digital data features and category features in the same way as data set 1.

Existing Deep Learning Methods
There is no time connection between inputs in traditional neural networks, but the recurrent neural network RNN has a natural advantage for data with a certain time connection. Because of the problem of gradient disappearance, the gated recurrent unit (GRU) is proposed by changing the RNN's internal propagation structure, giving the model a long-term memory. Another common recurrent neural network that is more powerful and universal is the long short-term memory network (LSTM), which adds a forgetting gate and output gate on the basis of the updating gate. Cho et al. proposed a new neural network model called the RNN encoder-decoder model [19]. This model consists of two recurrent neural networks. Due to the defects of the encoder and decoder, an attention mechanism is proposed. The attention model generates a vector sequence in the encoding stage, and each input in the decoding stage generates a subset in the vector sequence, which focuses more on the part of the vector sequence that is most closely related to the current input.

Trend Alignment with Dual-Attention Multi-Task Recurrent Neural Networks
Trend alignment with dual-attention multi-task recurrent neural networks (TADA) was published in 2018 by Chen et al. under the ICDM sales forecast method [20]. To a certain extent, this model improves the prediction performance.
The goal of sales forecasting is to predict future sales through a variety of influencing factors and known sales values. The input is defined as the total set of features {x t } T t=1 = {x 1 , x 2 , . . . , x T }, and the corresponding sales volume set is y t T t=1 = y 1 , y 2 , · · · , y T . At time t, x t ∈ R n , where n is the characteristic dimension, T is the whole time stage, and the output of the sales forecast is the next ∆ selling value after the T time stage, which can be expressed as the following formula: ŷ t T+∆ t=T+1 = ŷ T+1 ,ŷ T+2 , · · ·ŷ T+∆ , where ∆ is determined by the target of the sales forecast. Assume that ∆ T and {x t } T+∆ t=T+1 are unknown factors in the forecast phase. Compared with the traditional autoregressive model, the performance of the sales forecasting model is different because the scalar value to be predicted has no characteristics in the future. Therefore, we model sales forecasts with the following characteristics: (1) where {x t } T t=1 is the feature from the first timestamp to time T; y t T t=1 is the historical sales information; ŷ t T+∆ t=T+1 is the value to be predicted; and F(·) is the nonlinear mapping method to be learned.
First, the basic model is coded using the LSTM model, and the activation function σ is the Logistic Sigmoid function; ω, b are the weight values to be updated. Internal  . Internal features include information about internal attributes that are directly related to the product, such as the location of the store and the category of goods. External characteristics refer to the external attribute information of the factors, such as weather conditions at that time, whether it is a special holiday, etc. This is why a single LSTM encoder structure may lose some contextual information by mapping all the original features into a unified space. Therefore, two parallel LSTMS are used to effectively capture the different influencing modes of features by modeling the internal and external features as two subtasks. Accordingly, the problem in Equation (1) is extended to The structure diagram of the coding phase in the TADA model is shown in Figure 5. effectively capture the different influencing modes of features by modeling the internal and external features as two subtasks. Accordingly, the problem in Equation (1) is extended to The structure diagram of the coding phase in the TADA model is shown in Figure 5. internal feature input and { } =1 external feature input, respectively. When the hidden layer , which is shown in Equation (3): where h int The internal-feature coded LSTM, the external-feature coded LSTM, and the combined LSTM encoders are different. Instead of sharing weights and offsets, each LSTM learns its own weights and offsets. After encoding all the historical information of the sales time series, we finally obtain the context vector h con . To predict the next sales time sequence ŷ t T+∆ t=T+1 , we use the decoder to simulate the next ∆ time context vector, in which T < t ≤ T + ∆: is the input of the decoder, and LSTM dec (·) is the decoder LSTM. x dec t is the attention weight, that is, the input of the decoder, and d con t−1 represents the output of the previous decoder's hidden layers.
As can be seen from Equation (1)  are both internal features and external features including unknowable features after time T. Therefore, to obtain the input of the decoder, we use the attention mechanism. As shown in Equation (6) : where α int tt and α ext tt represent the hidden layer attention weight of the internal feature encoder and the external feature encoder at time t − th.
x dec t refers to the input value at an unknown time, obtained according to the proportion of the attention in the output of the inner hidden layer and the outer hidden layer.
This effect was quantified by the correlation between d con t−1 and both h int t and h ext t : To obtain the input of the decoder at time t, e int tt and e ext tt correspond to the correlation score between the output of the hidden layer of the internal characteristic LSTM and the output of the hidden layer of the external characteristic LSTM and d con t−1 , respectively, at time t . Here, v int, v ext , M int, H int , and H ext are the parameters for the model to learn. Intuitively, the degree of direct correlation between two vectors is shown by projecting those vectors into a common space. The SoftMax function is used to apply the weights of the two attention mechanisms as After applying the SoftMax function, Equation (6)  will both carry the information of time t and the previous context. With an increase in the length of the prediction stage, the performance of the encoder and decoder network will decrease significantly. To alleviate this problem, a traditional attention mechanism is designed to make the current output consistent with the target input by comparing the current hidden state with the hidden state generated by the previous time step. However, these methods are not applicable to predict the approximate trend of the next ∆ time, so a new trend-adjusted attention mechanism is proposed. It is believed that the subsequent trend changes will be similar to the prior trend, so it is feasible to splice the most similar trends. As shown in Equation (9), p i represents the joint vector hiding layer: p i = h con i ; h con i+1 ; · · · ; h con i+∆−1 , 1 ≤ i ≤ T − ∆ + 1.
It can be seen intuitively that ∆ is the time window, and the original T time is divided into segments of ∆ length. At the same time, the trend p to be predicted can be expressed as p = d con T+1 ; d con T+2 ; · · · ; d con T+∆ where p i is like a sliding window: It will find the most similar window to p from each p i p i sliding window. The following formula can be used to measure the similarity between them: Then, the formula will find the closest sliding window: i = argmax e trd i , e trd i+1 , · · · , e trd T+∆−1 .
We assume that the most similar window is p i = h con i ; h con i +1 ; · · · ; h con i +∆−1 and combine the windows according to Equation (13): and use it to obtain the final predicted value, as shown in Equation (14):ŷ whereŷ t ∈ ŷ t T+∆ t=T+1 represents the prediction at time t, while v T y and b y are the parameters for us to learn.
In terms of model prediction, we use the mean square error as the loss function and L 2 regularization to prevent the model from overfitting during training:

A Deep Learning Model that Incorporates Future Known Features
This section focuses on improving the trend adjustment algorithm of the multi-attention mechanism. For the TADA model, the definition of the sales forecasting task shows that the predicted scalar values are featureless in the future.
The algorithm assumes that {x t } T+∆ t=T+1 is an unknown factor in the prediction stage, which can be improved in essence. For example, for pesticide data, the probability of temperature, precipitation, and other factors can be obtained through weather forecasts. These characteristics are likely accurate, and if we discard this part of the information, the accuracy of the prediction may be directly reduced. For data set 1, the city of the store is an internal characteristic, which will not change and has no effect on the prediction. However, the dates, holidays, and other factors are known in the prediction stage with a probability of 1. Thus, it is important for us to predict sales, but the original model does not consider this factor. For data set 2, if the predicted average temperature and other information are known, the prediction accuracy at that time will be improved. At the same time, we add information from the company's internal management system to obtain the next month's activity planning and other information. Therefore, it is very helpful to integrate this portion of the information into the task of sales forecasting.
Based on the above concepts, the original TADA model is improved. First, we redefine the sales forecast problem. The input of the improved sales forecast model is defined as all feature sets {x t } T t=1 = {x 1 , x 2 , · · · , x T } and the corresponding sales sets y t T t=1 = y 1 , y 2 , · · · , y T . In addition, there is some known information between T + 1 and T + ∆, which is represented by {z t } T+∆ t=T+1 . At time t, x t ∈ R n , z t ∈ R m , where n and m are the characteristic dimensions. The output of the sales forecast is the ∆ selling value after T, so the definition of the sales forecast can be improved with Equation (16): where {x t } T t=1 is a feature from 1 to T, and the feature dimension is n. y t T t=1 is the historical sales information. {z t } T+∆ t=T+1 is a part of the known probability features between T + 1 and T + ∆, and the feature dimension is m. We will next predict the value ŷ t T+∆ t=T+1 .
Due to the addition of information, there will be some changes to the TADA model to adapt the model to the definition of the new sales forecasting model. In the previous section, the basic models were an encoder and decoder model. Next, some adjustments are made to the encoder and decoder structure. First of all, the input features of the encoder stage are divided into internal feature input and external feature input. After two rounds of the LSTM algorithm, the output of the hidden layer is obtained, combined with the historical sales volume as a context vector that constitutes the coding phase of the model. Since {z t } T+∆ t=T+1 cannot be used in the encoding stage of the model, we retain the encoder stage of the model. In the decoder stage, the original model uses the multi-attention mechanism to obtain the input, so the information for this part can be added to the decoder part for training. The formula of the decoder part is as follows: where x dec t is the input of the LSTM model in the decoder stage. Since part of the information is unknown, and part of the information is known, during the prediction period, a new method combining the features of the attention mechanism with some known features is used. Here, the characteristic output of the attention mechanism is as follows: Please refer to the input in the decoding stage of the TADA model mentioned above for a detailed explanation of the formula. The improved TADA model adds {z t } T+∆ t=T+1 to the decoding stage, which is spliced with the result of the attention mechanism to obtain the input x dec t in the decoder stage: A schematic diagram is shown in Figure 6. The currently known features have been added in the decoder stage, where the category features are the embedded dimension reduction, and the weight vector of the embedded dimension reduction is consistent with (the weight of the embedded dimension reduction) the input in the coding stage. There are two innovations in the improved algorithm: One is the redefined sales forecast and the other is the improvement of the decoder.

Experiments and Results
1. Data set partitioning: In data set 1, we use a total of 365 data from 2016 to 2017 and divide the data into 15:2:2. In data set 2, we use the annual data of each province and city in 2017 and divide them at a ratio of 8:1:1. 2. The evaluation index: We use the mean absolute error, MAE, and the symmetrical mean absolute percentage error, MAPE (or SMAPE). 3. Gradient descent optimization method: We use mini batch gradient descent and Adam descent.

1.
Data set partitioning: In data set 1, we use a total of 365 data from 2016 to 2017 and divide the data into 15:2:2. In data set 2, we use the annual data of each province and city in 2017 and divide them at a ratio of 8:1:1.

2.
The evaluation index: We use the mean absolute error, MAE, and the symmetrical mean absolute percentage error, MAPE (or SMAPE).

3.
Gradient descent optimization method: We use mini batch gradient descent and Adam descent.
The formulas of MAE and MAPE are defined as follows (20)(21):

Experimental Results and Analysis
For data set 1, ∆ is set to 2, 4, and 8 for the experiment. The algorithm needs to divide the data features into internal features and external features to obtain more accurate experimental results. The internal characteristics of data set 1 include the store city, store state, store type, store group, commodity family, and commodity category. The external features include time, total sales volume, whether it is a local holiday, whether it is a national holiday, whether it is a weekend, whether the commodity is perishable, and the price of crude oil. In the decoder stage, the features of the F-TADA algorithm add the date, holiday information, and weekend information. The one-hot coding vector with 365-dimensional date information is embedded into 5 dimensions. The weight matrix of the dimensional reduction is trained by embedding the matrix in the coding stage. This method can solve the problem of insufficient training for date encoding in the decoder stage using the date embedding matrix. Holiday information is a Boolean variable. Therefore, the z t input dimension of the F-TADA algorithm is 8 dimensions in the decoding phase, and we also adjust the super parameters during the experiment. The super parameters include the number of hidden layer features and the coefficients of regular terms. Finally, we select 128 as the feature number of the hidden layer and 0.001 as the coefficient of the regular term, which are the best parameters for the TADA algorithm and F-TADA algorithm. To reproduce the experimental results of the original TADA paper, the data set partition, data preprocessing, and best parameters obtained are consistent with those in the earlier paper. However, due to the contingency of deep learning training, there is a slight deviation between the results of the original experiment and the results in this paper. In addition to the original TADA model, we add the encoder decoder model and attention mechanism model for the comparative experiment. The experimental results are shown in Table 1. As can be seen from the results in Table 2, when ∆ = 2, the accuracy of the F-TADA algorithm is similar to that of the original algorithm. The MAE value of the TADA algorithm is slightly better and the MAPE value of F-TADA, so the effect of the two algorithms is almost the same. However, with an increase of ∆, the advantages of the F-TADA algorithm gradually emerge, and the results are superior. Therefore, based on the results of data set 1, the improved algorithm is effective in sales series forecasting. With an increase in ∆, the algorithm effectively improves the prediction accuracy of data set 1.
For the prediction results of data set 1, we select ∆ = 4 data to visualize and draw a graph for the observations, as shown in Figure 7. After selecting the sales values over 65 days for visualization, the trend learned by the model was observed to be stable compared to the real value. Therefore, based on the results of data set 1, the improved algorithm is effective in sales series forecasting. With an increase in ∆, the algorithm effectively improves the prediction accuracy of data set 1.
For the prediction results of data set 1, we select ∆ = 4 data to visualize and draw a graph for the observations, as shown in Figure 7. After selecting the sales values over 65 days for visualization, the trend learned by the model was observed to be stable compared to the real value. Data set 2 uses the encoder-decoder model with the attention mechanism algorithm, TADA algorithm, and TADA algorithm fusing future known features for training (F-TADA). The TADA algorithm and F-TADA algorithm need to divide the features into internal features and external features in the encoder stage. The internal characteristics of data set 2 are as follows: downstream dealer province, downstream dealer city, region, brand, brand classification, region, and various land area data crawled. The external characteristics of data set 2 include time, date, average surface temperature, average air pressure, average temperature, average wind speed, precipitation, sunshine Data set 2 uses the encoder-decoder model with the attention mechanism algorithm, TADA algorithm, and TADA algorithm fusing future known features for training (F-TADA). The TADA algorithm and F-TADA algorithm need to divide the features into internal features and external features in the encoder stage. The internal characteristics of data set 2 are as follows: downstream dealer province, downstream dealer city, region, brand, brand classification, region, and various land area data crawled. The external characteristics of data set 2 include time, date, average surface temperature, average air pressure, average temperature, average wind speed, precipitation, sunshine hours, activity duration in the month, number of activities in the month, and number of activities in the month.
For the sales data of pesticides in data set 2, seasonal factors are particularly important. The sales volume of pesticides is zero in most months but becomes prominent during other months. For example, pesticides and herbicides are generally sold in the summer in northern cities, while sales in winter are 0. Because most of the sales values for the sales data are 0, the total annual sales of all kinds of drugs are 0 in some cities. Therefore, we screened the sales volume of 0 over 12 months in the experimental stage and used a flag to express this situation. It can be seen from Figure 8 that there were 4849 drug sales combinations in cities. Among them, only 223 data were sold throughout the year, while the data without sales over three months or within three months account for about one fifth of the total data. Moreover, the data with a sales volume of 0 over 7, 8, 9, and 10 months account for a large proportion. The predicted results for the stationary time series are better than those for the data with more than 0 values. Therefore, we selected three data sets with flag ≤ 3, flag ≤ 5, and flag ≤ 10 as three data sets for the experiment and 20 urban drug combinations as test sets in each data set. hours, activity duration in the month, number of activities in the month, and number of activities in the month. For the sales data of pesticides in data set 2, seasonal factors are particularly important. The sales volume of pesticides is zero in most months but becomes prominent during other months. For example, pesticides and herbicides are generally sold in the summer in northern cities, while sales in winter are 0. Because most of the sales values for the sales data are 0, the total annual sales of all kinds of drugs are 0 in some cities. Therefore, we screened the sales volume of 0 over 12 months in the experimental stage and used a flag to express this situation. It can be seen from Figure 8 that there were 4849 drug sales combinations in cities. Among them, only 223 data were sold throughout the year, while the data without sales over three months or within three months account for about one fifth of the total data. Moreover, the data with a sales volume of 0 over 7, 8, 9, and 10 months account for a large proportion. The predicted results for the stationary time series are better than those for the data with more than 0 values. Therefore, we selected three data sets with flag ≤ 3, flag ≤ 5, and flag ≤ 10 as three data sets for the experiment and 20 urban drug combinations as test sets in each data set. Data set 2 is characterized by short-period time series predictions. Here, only one year's worth of sales records is counted by month. Since most of the sales values in the sales records are 0, the problem of sales forecasting for data set 2 is a major challenge. When a certain drug in a city is used for an independent prediction, there are only 12 data without high quality, so it is impossible to carry Data set 2 is characterized by short-period time series predictions. Here, only one year's worth of sales records is counted by month. Since most of the sales values in the sales records are 0, the problem of sales forecasting for data set 2 is a major challenge. When a certain drug in a city is used for an independent prediction, there are only 12 data without high quality, so it is impossible to carry out an autoregressive prediction independently. For this kind of short period data, the time is not sufficient, but there are many cities and drug combinations. In the data set 2 prediction stage, the activity plan for the next month was uploaded in the company management system, and the weather forecast information can be obtained, as well as whether this month is a rice-growth period. Therefore, the best solution is to use the F-TADA algorithm for this type of sales data. By embedding words in the one-hot vectors of cities, medicines, and types, we can better learn the internal connections between data and improve the accuracy of the predictions.
In data set 2, the training process and decoder stage {z t } T+∆ t=T+1 include the following information: date, the number of activities in the month, medication rules, precipitation, average surface temperature, average temperature, average air pressure, sunshine hours, evaporation, and average wind speed. The 12-dimensional one-hot date vector is embedded and encoded into a 2-dimensional vector. The medication pattern is a Boolean type, while the rest are a digital type. Therefore, the z t input dimension of the F-TADA algorithm in decoding stage has 11 dimensions. The experiment was conducted by controlling ∆ = 3, as well as adjusting the super parameters during the experiment, including the number of hidden layer features and the coefficients of regular terms. Like with data set 1, we selected 128 as the feature number of the hidden layer and 0.001 as the coefficient of the regular term, which is the best parameter for the TADA algorithm and F-TADA algorithm. The final experimental results are shown in Table 3. As can be seen from the experimental results of data set 2, due to the lower quality of this data set, such as its short period, multiple zero values, etc., the prediction results are worse than those of data set one. We can also see that the F-TADA algorithm is improved compared to TADA. However, with an increase of the 0 value, the prediction accuracy decreases. Based on the effects of the F-TADA algorithm, the single attention mechanism, and the encoder and decoder model, the multi-attention mechanism and trend adjustment algorithm combined with future known features is superior.
The MAE value in the experimental results decreases with an increase of the 0 value. The reason for this phenomenon is that with an increase of the 0 value, the predicted value is close to 0, leading to a decrease in the MAE value, but the MAPE value can be measured across data sets. Therefore, with an increase in the 0 value, the MAPE value becomes increasingly worse. In this case, the MAE is not measured correctly.
Meanwhile, a change was observed in the three indicators with respect to the epochs considered during the training and testing processes after we obtained the optimal hyperparameters according to the validation data set (see Figure 9). to a decrease in the MAE value, but the MAPE value can be measured across data sets. Therefore, with an increase in the 0 value, the MAPE value becomes increasingly worse. In this case, the MAE is not measured correctly.
Meanwhile, a change was observed in the three indicators with respect to the epochs considered during the training and testing processes after we obtained the optimal hyperparameters according to the validation data set (see Figure 9). For the prediction results of data set 2, we select a data of flag ≤ 3 to visualize and draw a graph for observation, as shown in Figure 10. It is found that the model can learn the sales volume over the next three months and that the trend is correct.  For the prediction results of data set 2, we select a data of flag ≤ 3 to visualize and draw a graph for observation, as shown in Figure 10. It is found that the model can learn the sales volume over the next three months and that the trend is correct.
to a decrease in the MAE value, but the MAPE value can be measured across data sets. Therefore, with an increase in the 0 value, the MAPE value becomes increasingly worse. In this case, the MAE is not measured correctly.
Meanwhile, a change was observed in the three indicators with respect to the epochs considered during the training and testing processes after we obtained the optimal hyperparameters according to the validation data set (see Figure 9). For the prediction results of data set 2, we select a data of flag ≤ 3 to visualize and draw a graph for observation, as shown in Figure 10. It is found that the model can learn the sales volume over the next three months and that the trend is correct.

Summary
This section introduces the existing deep learning methods and F-TADA, which adds some future known characteristics of trend alignment with dual-attention multi-task recurrent neural networks. We combine the known features of the prediction stage, improve the decoder stage, and verify both data set 1 and data set 2. The experimental results show that the MAE and MAPE values of the multi attention mechanism and trend adjustment algorithm are lower than those of the original model based on the encoder-decoder model. According to the results of data set 1, with an increase in the decoding stage length, the improved algorithm offers a great improvement over the original algorithm.

Demand Analysis
In the production and sales process of many small and medium-sized retail enterprises, especially those lacking experience, without the assistance of advanced intelligent algorithms, it is inevitable that unnecessary losses will occur. To apply time series algorithms to these real-world scenarios and combine some behaviors in the business field with artificial intelligence, a sales forecasting and analysis decision-making system is developed.
For most small and medium-sized enterprises, the cost of hiring a professional data analyst or algorithm engineer team is too high, so the research and development of sales forecasting and intelligent decision-making platforms is very necessary. This sales forecasting platform can play a guiding role in a company's business planning, help develop reasonable strategies for inventory planning, reduce a company's unsalable risks, and maximize the sales benefits of popular products. Companies can obtain the prediction results and decision-making plans by providing data to the platform. This decision-making scheme can be used as the final solution for inventory management.
Python is used for common machine learning and deep learning algorithms to provide them with data interfaces. Finally, a website-based system was presented to the enterprise. Because this system is an online application, users do not need to install the software locally. Instead, they can simply use a computer or mobile phone to access the website.
The system's functional requirements include seven modules: the user login and logout module, the data import module, the data grouping module, the data visualization module, the data prediction module, the correlation analysis module, and the sand table simulation module.

1.
System architecture design: We adopt system architecture that separates the frontend and backend.

2.
Database design: MongoDB is used as the basic database.

3.
Frontend design: The frontend uses a page structure and is written in AngularJS.

4.
Backend design: We use the Flash framework with a Celery distributed task queue.

Function Demonstration
The whole system has a home page that provides a general introduction to the overall functions, as shown in Figure A1 (See Appendix A).
After logging in, users of the system enter the home page, as shown in Figure A2 (See Appendix A). In the middle part of the page, the historical forecast results and data forecast progress are presented.
Users can select and edit templates, and the Excel data template generation page is shown in Figure A3 (See Appendix A). The data import, data edit, and data group modules are shown in Figure A4 (See Appendix A). The data visualization module is shown in Figure A5 (See Appendix A). The data prediction module is shown in Figure A6 (See Appendix A).

Conclusions
In the sales prediction task, we often meet three tough challenges: (1) trend or pattern irregularity in sales time series data, (2) complex context-specific relationship among influential factors, including real sales number data, internal factors (e.g., brand, category, price, etc.), and external factors (e.g., weather, holiday, promotion, etc.), and (3) data scarcity. Thus, the state-of-the-art method, i.e., TADA, tries to solve the previous two challenges. To solve all the three challenges, our work differentiates itself by integrating all the available information (both the past and future) based on TADA. More specifically, we proposed a future-aware model called F-TADA. In the decoder stage of the model, the new features are spliced with the obtained decoder input, and the attention mechanism is obtained as the final input of the decoding stage. After trend matching of the TADA model, we obtained the final prediction results.
The experimental results show that the deep learning algorithm integrating known features in the prediction stage improves the accuracy of the original model and offers a better prediction of sales data by using more known information. Moreover, based on the experimental results for supermarket chains, as the length of the decoding stage increases, the F-TADA algorithm increasingly improves compared to the original algorithm. Based on the results of pesticide sales, the F-TADA algorithm can solve the possibility of similar short-term sales data predictions.
Meanwhile, existing works mainly pay attention to the algorithms without the illustration of a real sales prediction management system. In our work, to help the enterprise forecast sales data, we developed a sales data prediction and analysis decision-making system. This system features asynchronous architecture with a separation of the frontend and backend, along with a fast response speed and concurrent prediction of data using multiple machines and processes. This system is divided into four modules: the visualization module, the feature analysis module, the prediction module, and the sand table exercise module. The sand table simulation module is designed to save the model during model training by processing the original data, such as promoting the product, increasing the number of product sales activities, etc. This is an intelligent decision-making element of the system that can obtain the impact of the final prediction results by calling the model. Development of the sand table exercise module is another innovation of this paper and allows the sales forecast to act as a guide for sales in a real sense. In sum, we develop a flexible and user-friendly sales prediction management system. We demonstrate the system in the paper such that it can offer practical insights for both academic and industrial world.
Finally, for the algorithms, some concerns remain unaddressed. For instance, we did not consider seasonal effects in the model. Moreover, we only tested sales data prediction tasks with clear future information. More relevant data sets should be tested in the future. Furthermore, to handle data with missing labels, predictive contrastive coding could be utilized in the sales prediction task.    Figure A2. Main page of the sales data prediction and analysis decision system. If we click the "bell" sign, it will show the status (success or failure) of the sales prediction for any sales prediction task. Figure A3. Data template generation page in Excel. To adapt to different preprocessing methods (e.g., one-hot encoding and normalization), we first determined the features template. Figure A2. Main page of the sales data prediction and analysis decision system. If we click the "bell" sign, it will show the status (success or failure) of the sales prediction for any sales prediction task. Figure A1. Home page of the sales data prediction and analysis decision system. Figure A2. Main page of the sales data prediction and analysis decision system. If we click the "bell" sign, it will show the status (success or failure) of the sales prediction for any sales prediction task. Figure A3. Data template generation page in Excel. To adapt to different preprocessing methods (e.g., one-hot encoding and normalization), we first determined the features template. Figure A3. Data template generation page in Excel. To adapt to different preprocessing methods (e.g., one-hot encoding and normalization), we first determined the features template.
Information 2020, 11, x FOR PEER REVIEW 17 of 19 Figure A4. Loading the real sales prediction data set (csv or Excel format). View of the data import, data edit, and data group modules. Figure A4. Loading the real sales prediction data set (csv or Excel format). View of the data import, data edit, and data group modules. Figure A4. Loading the real sales prediction data set (csv or Excel format). View of the data import, data edit, and data group modules. Figure A5. View of the Data Visualization Module. Here, we can show the data distribution and characteristics of the data by specifying the available time periods. Figure A6. View of the Data Prediction Module. On the left, the "Predict Runner" shows allows us to choose some hyperparameters to set up the prediction model, such as the periods of the forecast, the Figure A5. View of the Data Visualization Module. Here, we can show the data distribution and characteristics of the data by specifying the available time periods. Figure A4. Loading the real sales prediction data set (csv or Excel format). View of the data import, data edit, and data group modules. Figure A5. View of the Data Visualization Module. Here, we can show the data distribution and characteristics of the data by specifying the available time periods. Figure A6. View of the Data Prediction Module. On the left, the "Predict Runner" shows allows us to choose some hyperparameters to set up the prediction model, such as the periods of the forecast, the Figure A6. View of the Data Prediction Module. On the left, the "Predict Runner" shows allows us to choose some hyperparameters to set up the prediction model, such as the periods of the forecast, the prediction model, and the validation method. On the right, the software can visualize the prediction results.