1. Introduction
With the development of the Internet and the arrival of the post-epidemic era, more and more businesses in the traditional catering industry have launched the Online-to-Offline (O2O) business mode. The O2O business mode specifically refers to online order placing and offline delivery. On-demand Food Delivery (OFD) platforms have also taken advantage of the opportunity to develop, such as Meituan and Ele.me in China, Uber Eats and DoorDash in the United States, Just Eat and Deliveroo in Europe, etc. With the expansion of the delivery field, it is no longer just to deliver food, but also medicine, flowers and other items. The scale of the Internet delivery industry is growing steadily. According to data released by the China Internet Network Information Center (CNNCI), as of December 2024, the number of online takeout users in China reached 592 million, accounting for 53.4% of the total Internet users [
1]. The development of China’s takeout industry is among the most advanced in the world [
2]. According to Meituan’s 2024 financial report, as one of the largest O2O e-commerce platforms for local services in China, the company achieved an annual revenue of CNY 337.6 billion (Renminbi), the number of annual transaction users exceeded 770 million, and the number of annual active merchants increased to 14.5 million [
3].
In the O2O business mode of a takeout system, four entities are involved: the OFD platform, merchant, rider, and user. The overall structure is shown in
Figure 1, which has a triangular pyramid shape. The dotted lines show online communication, the solid lines represent offline interaction, and the arrows indicate the direction of communication or interaction.
The whole process of order fulfillment consists of both online and offline components. In online fulfillment, the OFD platform typically acts as an intermediary, connecting the other three roles. Specifically, the user creates a takeout food order through the OFD platform, while the merchant and the rider receive the order information and confirm acceptance through the same platform. In offline fulfillment, the merchant prepares the food after the order is confirmed, while the rider picks up the order and delivers it to the user once ready. Under this mechanism, the rider is responsible for the offline transfer work, which plays a crucial role in the overall processes of takeout service. Therefore, the Order Fulfillment Cycle Time (OFCT, which refers to the time taken for an order from creation to delivery) is closely related to the time taken for activities, such as the rider accepting the order, arriving at the restaurant, picking up the food, and delivering it. These activities correspond to step S0, S1, S2, and S3 in
Figure 1. Specifically, step S0 indicates that the rider is considering accepting the order after receiving the notification that the order has been created. Step S1 indicates that the rider is heading to the merchant’s restaurant after accepting the order. Step S2 indicates that the rider is waiting to pick up the food after arriving at the restaurant. Step S3 indicates that the rider is on the way to the user’s address after picking up the food.
A common challenge faced by e-commerce industries in the O2O mode is short delivery time, due to the fact that users want fast service and are sensitive to delays [
4]. On the one hand, the time sensitivity of users is different for services in different industries. On the other hand, even within the same industry, users’ time sensitivity is heterogeneous [
5]. Time sensitivity, specifically the degree to which a user is sensitive to the time it takes to purchase a product, has both high and low distinctions, as well as heterogeneous distinction between patience and impatience [
6,
7]. For example, order fulfillment in the express logistics industry is on a daily scale, meaning it typically displays the expected delivery date on the platform [
8,
9]. Order fulfillment in the housekeeping industry is on an hourly scale, meaning it usually has options for several hours of to-home service, and there are also hourly billing rules [
10]. Order fulfillment in the takeout industry is on a minute scale, meaning it is displayed in minutes on the OFD platform’s merchant card and detail page for OFCT [
11]. This is related to the particularity of the OFD platform, which requires fast delivery services to ensure the freshness of takeout food [
5]. In contrast, the takeout industry has the highest time sensitivity. And the heterogeneous sensitivities of users make their tolerance for delay times different. Therefore, the estimated OFCT displayed on the OFD platform directly influences users’ choices and preferences for merchants when they are ready to place orders [
12]. And after orders are placed, whether the food is delivered as soon as possible or on time can greatly affect users’ expectations and satisfaction. Based on these, in order to improve user retention rate and backend management efficiency, the OFD platform needs to accurately predict the OFCT.
In fact, the estimated time of takeout order falls within the scope of the Estimated Time of Arrival (ETA) problem [
12]. ETA, which refers to the estimation of travel time from origin to destination, has been well studied in logistics and transportation [
13,
14,
15,
16]. However, existing studies of estimated time of takeout order still have the following limitations. First, many studies [
12,
17] treat order information as static, ignoring the fact that order status actually evolve dynamically. As order fulfillment is progressively completed, the dynamic information associated with it is constantly increasing [
18]. As a result, the information on completed historical orders is both static and dynamic. The estimation of OFCT for a new order is usually performed before the order is created, when some of its key information is not available. This is because the OFCT displayed on the merchant card and detail page of the OFD platform usually exists before the order is created, and some of the order content and rider features are unknown at this time. Meanwhile, this is where estimated time of takeout order is different and difficult compared to other ETA problems. Second, while some studies [
11,
18] have achieved effective predictions of OFCT or other order times, there is still a lack of parallel prediction for multi-point time in orders. The OFD platform tends to improve the user experience by providing cross-entity transparency and interoperability through an intelligent takeout system [
19]. Therefore, the parallel prediction of multi-point time will help to improve the management performance of the OFD platform. Third, the temporal inconsistency in the order data has not been further analyzed by many existing prediction methods. Temporal inconsistency, specifically the number of orders that are variable at different times of the day, is related to the fact that user demand fluctuates over time, and also means that there are peak and trough periods for orders [
20].
To address these gaps, this paper uses deep learning to enable multi-point time prediction of food orders before creation, which provides new possibilities for takeout service optimization and backend management. The main contributions of this paper are summarized as follows.
- 1.
This paper proposes a data processing method for time chain simulation. In order to simulate the order fulfillment process, the dynamic and static information involved in the fulfillment of takeout food order is innovatively integrated and divided into the form of time chains. Specifically, the values of features under different statuses are static if they do not change, and dynamic if they do. And the processes of the time chains contain features with different statuses, which makes their dynamic and static evolution evident. This method also makes the intervals and trends of the sequence data more visible through the principle of simulation, which can be adapted to subsequent GRU-Transformer recognize and capture. Additionally, time chains function as a method to expand the dimensionality of the data to enable a steady-state segmented simulation, which can increase the scalability of the data to some extent.
- 2.
This paper proposes a GRU-Transformer architecture for sequence-to-sequence parallel prediction. Compared with the traditional Gated Recurrent Unit (GRU) and Transformer individually, the GRU-Transformer architecture combines the strengths of both. As a result, the new architecture effectively captures dependencies in time series while enabling feature enhancement learning. Moreover, thanks to the functionality of the GRU, it has a better ability to perceive the intervals and trends of sequences, which helps to improve its prediction performance.
- 3.
In the takeout scenario, after implementing time chain simulation and GRU-Transformer architecture, it is able to effectively predict the multi-point time before order creation. The experimental results show that the Mean Squared Error (MSE) of the prediction results of the GRU-Transformer with time chain simulation is reduced by about 4.83% compared to the GRU-Transformer alone and by about 9.78% compared to Transformer. Moreover, the MSE of the prediction result of GRU-Transformer alone is reduced by about 5.20% compared to the Transformer. Finally, in the analysis of temporal inconsistency, GRU-Transformer with time chain simulation performs well in peak periods, but it slightly underperforms in trough periods.
The rest of this paper is organized as follows:
Section 2 shows the related literature review.
Section 3 introduces the method and architecture proposed in this paper in detail.
Section 4 shows the experimental process and result analysis.
Section 5 provides an in-depth discussion. Finally,
Section 6 concludes this paper.
2. Literature Review
This section first describes many studies related to estimated time of takeout order, and then elaborates in detail on the development of time series prediction methods.
2.1. Estimated Time of Takeout Order
In recent years, time prediction of takeout order has become a hot topic in the field of e-commerce, attracting the attention of many scholars in algorithm research and machine learning.
Fulfillment-Time-Aware Personalized Ranking is a recommendation method proposed by Wang et al., and the OFCT prediction module mentioned in this method uses the Transformer architecture [
11]. Compared with other models, the convergence and effectiveness of OFCT prediction module are proved. However, the module predicts the corresponding OFCT given the collection of historical orders and next time step feature vectors. This lacks consideration of the fact that some key information is unavailable before the order is created. After capturing the key features in the takeout fulfillment process, Zhu et al. input these features into a deep neural network (DNN), and then introduced a new post-processing layer to improve the convergence speed, thereby achieving effective prediction of OFCT [
12]. Moreover, through an online A/B test deployed on Ele.me, it is demonstrated that the model reduced the average error in predicting OFCT by 9.8%. However, this approach does not take into account the fact that the order status is actually changing dynamically; it just treats the features all as static input of the model. Combined with probabilistic forecast, Gao et al. proposed a deep learning-based non-parametric method for predicting the food preparation time of takeout orders [
21]. And based on the results of the online A/B test of Meituan, it can be seen that the method reduces the waiting time for picking up food among 2.17~4.57% of couriers. Although the non-parametric approach is more flexible, it can only output a finite number of prediction points and is limited in its ability to portray the tails or extremes of the distribution. In order to meet the surge in order demand during peak periods, Moghe et al. proposed a new system based on the collaborative work of multiple machine learning algorithms to enhance the estimation ability of batch order delivery time [
22]. However, in practice, there are still many orders that cannot be batch delivered, and the method lacks a more in-depth consideration of this aspect. For multiple critical times in order fulfillment, Wang et al. present different methods used by Uber Eats to predict food preparation time, travel time, and delivery time, respectively [
18]. But research on parallel prediction for these critical times is still missing. Şahin et al. applied the random forest algorithm in online food delivery service delay prediction to investigate compliance with fast delivery standards [
17]. Since the method converts the prediction problem into a binary classification problem, although it is simple and convenient, it also sacrifices the fineness of the prediction.
Based on these, this paper identifies several research gaps regarding the estimated time of takeout orders. Therefore, the methods proposed in this paper innovatively address these gaps. First, this paper proposes a data processing method based on the concept of simulation. While conforming to the actual situation of order fulfillment, the method can realize the fusion of dynamic and static information in the order. Second, this paper builds a model architecture for parallel prediction in order to achieve effective prediction of multi-point time before order creation.
2.2. Time Series Prediction
In the takeout scenario, the system provides order service according to the “first created, first fulfilled” rule, and the rider resources available for scheduling are often fixed. Therefore, as a whole, after sorting by creation time, the OFCTs of historical orders have an impact on the OFCT of subsequent orders, which means that their OFCTs are correlated. Then, considering that different orders may involve the same entity, after grouping by entity, the OFCTs of the orders within the group tend to have a temporal dependency. For example, orders belonging to the same merchant are usually fulfilled based on temporality, which means that historical orders can impact the food preparation of subsequent orders. As a result, for orders characterized by significant temporality, this paper uses a model belonging to time series prediction to satisfy the forecasting needs.
With the increasing application of time series prediction, many scholars are constantly committed to model innovation to achieve more effective and accurate predictions. Time series prediction methods can be roughly divided into two categories, namely statistical methods and machine learning methods. First, methods based on traditional statistics mainly include Auto Regressive (AR) [
23], Moving Average (MA) [
24], Autoregressive Moving Average (ARMA) [
25], Autoregressive Integrated Moving Average (ARIMA) [
26], Hidden Markov Model (HMM) [
27], etc. However, the predictive performance of these methods is limited. Second, methods based on machine learning mainly include Support Vector Machine (SVM) [
28], Decision Tree [
29], Random Forest [
30], Feedforward Neural Network (FNN) [
31], Multilayer Perceptron (MLP) [
32], Recurrent Neural Network (RNN) [
33], Long Short-Term Memory (LSTM) [
34], Gated Recurrent Unit (GRU) [
35], etc. Although these methods are more generalizable, there is still room for improvement in predictive accuracy.
Transformer, a more efficient and parallel computing architecture proposed by Vaswani et al. in 2017 [
36]. It was originally designed for natural language processing tasks, especially machine translation. Inspired by the idea of Transformers, scholars have effectively improved the architecture and derived a series of new methods based on the Transformer for better application in time series prediction. By reversing the role of Attention Mechanism and Feed-Forward Network, Liu et al. proposed iTransformer to achieve better prediction performance [
37]. However, when the feature variables are low dimensional, the predictive ability of iTransformer is still insufficient. Kitaev et al. proposed a Reformer with locality sensitive hashing (LSH) attention and reversible residual layers, which improves memory-efficiency and performs better on long sequence [
38]. But its advantage in short sequence prediction is not obvious. To compensate for the cross-dimension dependency of the Transformer, Zhang et al. proposed a Crossformer with Dimension-Segment-Wise (DSW) embedding and Two-Stage Attention (TSA) layer [
39]. However, the predictive ability of the Crossformer is limited to small-scale data. Tong et al. designed a multiscale residual sparse attention model RSMformer based on the Transformer architecture, and it performed well in long-sequence time series prediction tasks [
40]. But RSMformer performs slightly less well on non-stationary time series that lack significant periodicity. Informer is an architecture proposed by Zhou et al. by designing the ProbSparse self-attention mechanism and refining operation, which solves the problems of Transformer’s quadratic time complexity and quadratic memory usage [
41]. However, for shorter horizons, Informer’s prediction performance is not stable. Inspired by stochastic process theory, Autoformer is a new decomposition architecture with an autocorrelation mechanism based on sequence periodicity proposed by Wu et al. [
42]. But its generalization ability on non-stationary data is still limited. The Block-Recurrent Transformer designed by Hutchins et al. applies transformer layers recursively along the sequence, and it utilizes parallel computing within the block to effectively utilize the accelerator hardware [
43]. However, when its state space is too large, the Block-Recurrent Transformer struggles to effectively play its recurrent role.
Accordingly, after summarizing the shortcomings of the above studies, this paper chooses to combine the advantages of GRU and Transformer with each other, so as to propose a new architecture to achieve time series prediction in takeout scenarios.
4. Experiments and Results
In this section, this article demonstrates the effectiveness of time chain simulation and GRU-Transformer in parallel prediction of multi-point time through detailed experiments.
4.1. Dataset
The real-world takeout food order data from 6 July 2023 to 13 July 2023 is used in the experiments in this paper (data sourced from the OFD platform: Life Plus).
Life Plus is a local life digital comprehensive service platform with takeout business as its core and diversified county economy as one. As of now, the platform has a user base of 1.08 million, over 25 million service orders and an annual turnover of over CNY 210 million (Renminbi) [
49]. Among them, the main location of the on-demand food delivery business is in Guizhou Province, China.
4.2. Data Preprocessing
Firstly, this paper filters the takeout food orders and retains only the data whose order status is completed. Secondly, due to the existence of a few special situations in which the users go to restaurants to pick up the takeout food by themselves, this means that the food for these takeout orders does not need to be delivered by riders. Therefore, these takeout records are directly excluded because these do not qualify as delivery orders.
In addition, takeout orders can be categorized into advance orders and normal orders. The delivery time of advance orders are set by the users, meaning they are not instantaneous deliveries. Therefore, this paper statistically finds that only 112 advance orders have a Rider Delivery Time Taken greater than 1.5 h (i.e., > 5400), accounting for about 0.28% of all valid orders. Because this proportion is very small, the advance orders with > 5400 will not have a large impact on the subsequent calculation after being excluded. After screening for outliers, a total of 39,735 orders were used for subsequent experiments, which involved 1404 merchants and 319 riders.
Then, in order to prove that there is a correlation between OFCTs in the overall order data, this paper conducts the Ljung–Box test (
) for the
(i.e., OFCT, which is also the Rider Delivery Time Taken) after sorting the orders by creation time, and the results are shown in
Table 2.
The Ljung–Box test is a method for testing the overall autocorrelation of a time series over multiple lags [
50]. Its null hypothesis (H
0) is as follows: there is no autocorrelation in the top
m lags of the time series, and its Q-statistic
. Therefore, as can be seen from
Table 2, the
p-values for both the top 10 lags and the top 20 lags are less than 0.05 at the significance level of
, which indicates that the Ljung–Box test results reject the null hypothesis. At the same time, this also implies that there is a significant autocorrelation in the
series, thus indicating that the OFCTs of historical orders have an effect on the OFCT of subsequent orders.
Finally, the trend changes in the number of orders for the week of 6 July 2023 to 12 July 2023 are plotted in
Figure 4. From
Figure 4a, it can be observed that the number of orders is high on 8 and 9 July 2023, which indicates that the number of takeout food orders increases on weekends. From
Figure 4b, it can be observed that the number of orders is high in the 12th, 13th, 18th, and 19th hour of a day, which indicates that the 12th, 13th, 18th, and 19th hour of the day are the peak periods for takeout food orders.
4.3. Feature Extraction and Time Chain Simulation
During the fulfillment process of a takeout order, a lot of features and information are generated. This paper extracts the following features from the perspectives of order, merchant, and rider.
- 1.
Order time features: hour of a day, weekend, meal period, and peak period. Among them, meal periods are divided into breakfast, lunch, afternoon tea, dinner, midnight snack and other.
- 2.
Order food features: order price, food quantity, and average food preparation time.
- 3.
Merchant features: merchant identifier (i.e., merchant ID), distance from merchant to user, the average number of effective orders a month for the merchant, the current number of uncompleted orders by the merchant (
MU), the current number of orders on the way by the merchant (
MD) and merchant delivery ratio (
MDR). The calculation formula of the merchant delivery ratio is
and to prevent the denominator from being 0,
The average number of effective orders a month for the merchant specifically refers to the average number of effective orders completed by merchants every day in June 2023, which is used to measure the delivery capabilities of the merchant.
- 4.
Rider features: rider identifier (i.e., rider ID), the distance that the rider currently needs to go to the merchant (
RTM), the distance that the rider currently needs to go to the user (
RTU), cycling tool, rider level, the current number of uncompleted orders by the rider (
RU), the current number of orders on the way by the rider (
RD), rider delivery ratio (
RDR) and the number of orders taken by a rider from the same merchant. Where the calculation formula of rider delivery ratio is
and to prevent the denominator from being 0,
Combined with
Table 1, it can be seen that in the S0 status, the rider has not accepted the order and does not need to go anywhere at this time, so
RTM =
RTU = 0. In the S1 status, the rider only needs to go to the merchant. The
RTM corresponds to the distance between the rider and the merchant at the time of accepting the order, and
RTU = 0. In the S2 status, the rider has arrived at the restaurant and is waiting to pick up the food. There is no need to go anywhere at this time, so
RTM =
RTU = 0. In the S3 status, the rider has picked up the food and only needs to go from the merchant to the user. So, at this point,
RTM = 0, and the
RTU corresponds to the distance between the rider and the user at the time of picking up the food.
In addition, in the first time chain, the rider’s features are unknown at this time since the time chain demonstrates the progress of the rider has not accepted the order. Specifically, the value of the rider’s features in the fourth process under S0 status should be set to 0. Finally, the above feature variables are concatenated in the form of the time chain processes in
Figure 2. Among them, each process has all the feature variables, but the values of the feature variables are different in different statuses.
4.4. Experimental Details
In this paper, considering that orders are affected differently among different merchants, the order data from 6 July 2023 to 13 July 2023 are firstly grouped by merchant ID. Secondly, it is organized into the form of time chains and combined with time window sliding to generate samples with corresponding input sequences, target sequences and prediction sequences. Then, the samples from different groups are combined. Finally, the samples whose time belongs to the last 24 h are used as the test set, and the remaining samples before the last 24 h are randomly divided into training and validation sets in the ratio of 8:2.
Regarding the experimental setup, this paper uses Mean Squared Error (MSE) Loss as the loss function, Adam as the optimizer, and a learning rate of 0.0001. MSE Loss is chosen because it emphasizes reducing the larger prediction errors, making it suitable for the prediction task at hand. Adam is an optimization algorithm that adaptively adjusts the learning rate based on first and second moments estimation of the gradients. This means that Adam with its flexibility is suitable to optimize models with heterogeneity, such as the Transformer, which has a non-sequential stack of disparate parameter blocks [
51]. Then, considering that the different modules of GRU-Transformer are stacked non-sequentially, which means that it is similarly heterogeneous to the Transformer, we choose Adam for model training. Moreover, we set the initial learning rate to 0.0001 is consistent with the relevant content in [
37,
52] which suggests that a lower learning rate can contribute to the stable convergence and more detailed tuning for GRU-Transformer. The loss of the validation set is used as the monitoring metric (i.e., Validation Loss), and the model is iteratively trained for 300 epochs. The model that minimizes the Validation Loss is selected as the optimal model. Moreover, all experiments are performed on a single NVIDIA GeForce RTX 2080 Ti.
To evaluate the effectiveness of time chain simulation, this paper notates the GRU-Transformer that incorporates time chain simulation sequences for the prediction task as GRU-Transformer-withTC, contrasting it with the standard GRU-Transformer without time chain simulation. The key parameters for GRU-Transformer are the number of GRU layers
N and the number of Cross Attention heads
H. The value of
N determines the ability of the GRU-Transformer to capture sequence dependencies, but too many GRU layers may cause the gradient to vanish or explode. Similarly, the value of
H affects the GRU-Transformer’s focus on important features, but too many Cross Attention heads may lead to overfitting or loss of focus. Therefore, this paper defines the selection space for
N as [2, 3, 4] and for
H as [2, 4, 8]. The minimum Validation Loss is used as the evaluation metric, and grid search is employed to find out the optimal combinations of these parameters. The specific results are shown in
Figure 5.
As shown in
Figure 5, the results from the grid search are visualized using heat maps, with evaluation metric values represented as the corresponding heat values. The key parameters corresponding to the grid cell with the smallest Validation Loss value are considered as the optimal combination. Specifically, the optimal combination of key parameters for GRU-Transformer is (
N = 3,
H = 4), i.e., the number of GRU layers is 3 and the number of Cross Attention heads is 4. Moreover, the comparison shows that the Validation Losses for GRU-Transformer-withTC are all smaller than GRU-Transformer. In summary, the search results in
Figure 5 intuitively reflect the model performance for GRU-Transformer with different combinations of key parameters, thus contributing to finding the best combination (
N = 3,
H = 4).
4.5. Results and Analysis
In this section, in order to fully demonstrate the performance of GRU-Transformer and the effectiveness of the time chain simulation, the paper analyzes the comparison results, the results of temporal inconsistency, and the results of parameter sensitivity, respectively.
4.5.1. Comparative Analysis
First, multiple models are used for comparative analysis. The training iteration process of the models belonging to the neural network is shown in
Figure 6. The evaluation results for all models are specified in
Table 3.
- 1.
Module ablation: Transformer [
36] and LSTM-Transformer are compared to demonstrate the role of the GRUs module in GRU-Transformer. Among them, LSTM-Transformer refers to the model obtained by replacing GRUs with LSTMs in GRU-Transformer. Additionally, GRU-Transformer-withTC using time chain sequences is used for comparison to demonstrate the impact of time chain simulation.
- 2.
Transformer-based models: some highly efficient models are used for comparison to demonstrate the effectiveness of GRU-Transformer, such as iTransformer, Reformer, and Crossformer. Among them, iTransformer achieves higher prediction performance than Transformer through inverted attention and Feed-Forward Networks, and its stronger generalization ability is suitable for solving the problem at hand [
37]. Reformer achieves comparable performance to Transformer through locality-sensitive hashing of attention and reversible residual layers, and it is more memory-efficient, which makes it suitable for solving the problem at hand [
38]. With the ability to capture both cross-time and cross-dimension dependency, Crossformer is suitable for current multivariate time series forecasting [
39].
- 3.
Other architecture-based models: some state-of-the-art models are used for comparison to demonstrate the architecture performance of GRU-Transformer, such as Mamba and Temporal Convolutional Network (TCN). Among them, Mamba not only has linear scaling capability for sequence length, but also has a simpler architecture [
52]. Moreover, Mamba performs significantly well in handling computations between relevant variables, which makes it suitable for solving the problem at hand. TCN is a unique architecture with causal convolutions and dilated convolutions and it performs well in the task of multiple time series prediction [
53]. Moreover, TCN is able to capture long-term dependencies more effectively using cross-sequence modeling, which makes it suitable for solving the problem at hand.
- 4.
Gradient boosting-based models: in order to demonstrate the superiority of the GRU-Transformer architecture, which belongs to the neural network, gradient boosting models that have good performance and belong to non-neural network are used for comparison, such as Gradient Boosting Decision Tree (GBDT) and eXtreme Gradient Boosting (XGBoost). Among them, GBDT has good robustness and flexibility [
54], XGBoost has good prediction accuracy and can be trained quickly [
55].
In
Table 3, R-squared (
R2), Mean Squared Error (
MSE), Mean Absolute Error (
MAE), and Root Mean Squared Error (
RMSE) are used as evaluation indicators for the test set. The specific calculation formulae are as follows.
where
is the
g-th true value, and
is the
g-th predicted value.
Combining
Figure 6 and
Table 3, it can be seen that GRU-Transformer-withTC converges faster and predicts better than GRU-Transformer. Specifically, the Validation Loss of GRU-Transformer-withTC has reached the minimum at epoch = 246. Its
MSE is reduced by about 4.83% compared to GRU-Transformer and 9.78% compared to Transformer. This shows that the feature changes and process advancement included in the time chains allow the sequence data to show the intervals and trends more prominently. This benefits the GRUs module in GRU-Transformer to capture temporal dependencies between sequences. Thus, time chain simulation can increase the scalability of the data and can assist the GRU-Transformer to recognize and integrate the evolution of dynamic and static information more easily. Moreover, the
MSE of GRU-Transformer is reduced by about 5.20% compared to Transformer, which further proves that the GRUs module has a significant effect on improving performance. And, according to the prediction results of LSTM-Transformer, it can be seen that GRU-Transformer performs better. This is due to the fact that GRU has a simpler and more computationally efficient structure than LSTM. Then, GRU-Transformer also has better prediction performance compared to other Transformer-based models, which have about 2.3 times the total params of Transformer, but much less computational effort than Crossformer. In addition, although Mamba and TCN have faster convergence and fewer total params than GRU-Transformer, they have higher prediction errors. Specifically, the
MSE of GRU-Transformer is reduced by about 18.70% compared to Mamba and 10.26% compared to TCN. This may be because Mamba struggles to capture complex time series patterns. As for TCN, despite enlarging its receptive field with dilated convolutions, it still faces limitations in this regard. Finally, although GBDT and XGBoost have very low computational costs, their prediction accuracy is lower compared to GRU-Transformer. This also means that the predictive ability of gradient boosting-based models is still limited. In summary, both prediction accuracy and convergence speed indicate that GRU-Transformer-withTC is the most effective model. Although it has more parameters due to the time chain simulation, this design choice results in superior prediction performance compared to other models.
Then, this paper also compares the computational overhead of GRU-Transformer and Transformer, as shown in
Table 4.
Based on the time complexity
of Transformer [
36], the key difference with GRU-Transformer lies in the addition of the GRUs module. Specifically, the time complexity of GRU-Transformer becomes
, where
n is the sequence length and
d is the dimension. The increased GPU memory usage of GRU-Transformer, compared to Transformer, is primarily due to the larger number of parameters in the GRUs module. However, GRUs perform faster when dealing with shorter sequences, which enables GRU-Transformer to utilize the GPU for less time than Transformer. This suggests that GRU-Transformer not only efficiently capture the temporal dependencies in sequences, but also focuses on the most important features in the sequences, thereby improving its prediction performance.
Finally, in order to analyze the model performance in depth, this paper calculates the Absolute Error (i.e.,
) of all prediction results of GRU-Transformer-withTC, GRU-Transformer, and Transformer, respectively (i.e.,
and
). Then,
and
as one paired-sample,
and
as another paired-sample, and the differences within the paired-samples are calculated separately, i.e.,
and
. For the normality of
and
, since the test set is a large sample, the Anderson–Darling test [
56] adapted to them is used here. However, the test results show that both
and
do not follow a normal distribution at the significance level (
). Therefore, this paper further uses the Wilcoxon test [
57], which is a nonparametric method for determining whether differences in model performance are significant. Specifically, two-sided Wilcoxon tests are conducted for each of the two paired-samples, and their null hypothesis (H
0) is that the model performance of the paired-sample is similar. The results are shown in
Table 5.
As can be seen in
Table 5, the
p-values for both paired-samples are less than 0.05, which means that both Wilcoxon tests reject the null hypothesis. Since the performance difference between GRU-Transformer-withTC and GRU-Transformer is significant, this again demonstrates the effectiveness of the time chain simulation. Since the performance difference between GRU-Transformer and Transformer is also significant, this again proves the importance of the GRUs module. Accordingly, combining all the above comparative analyses, the superiority of GRU-Transformer-withTC over other models is fully demonstrated.
4.5.2. Temporal Inconsistency Analysis
Considering that the prediction sequences contain the prediction results of four time points, their corresponding
RMSE values are calculated in this paper to evaluate the prediction error of GRU-Transformer-withTC at each hour, as shown in
Figure 7.
For the existence of temporal inconsistency in order data, this paper calculates the prediction error (
RMSE) of GRU-Transformer-withTC at different hours, as shown in
Figure 7a. Among them, some
RMSE values are 0 because there are no order data in the test set at the 3rd to 7th hours of the day. And
RMSE values are higher in the 1st, 2nd, 23rd, and 24th hours of the day, which in combination with
Figure 4b shows that these hours are trough periods for takeout. Trough periods refer to the times of day, such as early morning hours, when there are significantly fewer takeout orders. During these hours, demand for takeout food is low, resulting in fewer order samples. Additionally, with a very limited number of active riders in trough periods, there is a noticeable drop in delivery labor across the takeout system. This also means that the rider may be farther away from the merchant when accepting the order, and nighttime road conditions can further complicate the delivery. As a result, these challenges impact order fulfillment during trough periods, which in turn affects the prediction performance of GRU-Transformer-withTC at these times. Nevertheless,
RMSE values of GRU-Transformer-withTC at all other hours are less than 10 min. And it is worth noting that takeout orders significantly increase during peak periods, which occurs when demand is higher, such as lunch time and dinner time. Specifically, peak periods correspond to the 12th, 13th, 18th, and 19th hours of the day. While the overall takeout system becomes busier during peak periods, it is also more difficult to accurately predict the multi-point time of food orders. Therefore, the stable performance of GRU-Transformer-withTC in peak periods is valuable for practical applications.
Figure 7b illustrates the distribution of
RMSE for
and
using box plot, where the outliers correspond to the larger prediction error during the trough periods. Among them, the outlier of
corresponds to its
RMSE in the 2nd hour of the day. Based on the actual takeout scenario, it is related to the complex situation of order fulfillment in the early morning, which leads to a larger prediction error. In addition, since the true value of
is inherently smaller, the
RMSE of its predicted results is generally smaller. The
RMSE for
has a wide range of fluctuations, which is related to the road conditions and the distance from the rider to the merchant at the moment the rider accepts the order. In summary, based on the
RMSE patterns during trough periods, it is clear that special nighttime deliveries require the addition of more factors to be taken into account, such as active riders and road conditions. These factors may help the model to recognize complex situations, further improving the prediction performance during trough periods.
In order to further highlight the advantages and limitations of GRU-Transformer-withTC, this paper compares the prediction errors of other models for multi-point time at different periods. The specific results are shown in
Figure 8 and
Figure 9, respectively.
As can be seen from
Figure 8, the prediction error of GRU-Transformer-withTC is generally lower in peak periods, and its performance advantage is more significant when predicting
. This shows that the prediction performance of GRU-Transformer-withTC is stable for multi-point time with different sizes of time taken values. And it means that GRU-Transformer-withTC successfully recognizes the order fulfillment pattern in the busy state of the takeout system. Comparatively, the performance of Reformer and Crossformer in peak periods is less stable.
Then, as can be seen from
Figure 9, GRU-Transformer occasionally outperforms GRU-Transformer-withTC during trough periods, such as at the second hour of the day. This suggests that the time chain simulation is somewhat less effective in the complex early morning conditions. Moreover, GRU-Transformer alone still generally performs better than LSTM-Transformer and Transformer most of the time.
4.5.3. Sensitivity Analysis
Finally, this paper performs a sensitivity analysis for the key parameters in GRU-Transformer-withTC, i.e., the prediction error of multi-point time for different combinations of GRU layers
N and Cross Attention heads
H is shown in
Table 6.
According to
Table 6, it can be seen that after different combinations of
N and
H, the
RMSE of GRU-Transformer-withTC for predicting multi-point time does not change much. This indicates that GRU-Transformer-withTC has good robustness. Moreover, the optimal combination (
N = 3,
H = 4) predicts the overall
RMSE of
and
as 159.2017, 450.9884, 486.9597, and 507.3490, respectively, where
is the Order Fulfillment Cycle Time, and its overall error is not more than 8.5 min.
5. Discussion
Our research addresses the problem of parallel prediction of multi-point time in takeout scenarios when some important information is unknown before order creation. In this paper, we first propose a time chain simulation approach to achieve a steady-state segmented simulation of the order fulfillment process through the evolution of dynamic and static information, thus allowing the temporal dependencies between sequences to be more easily captured. Then, combining the respective strengths of GRU and Transformer, the GRU-Transformer architecture is designed to be used to better perceive the intervals and trends of sequences, so that it could enable more efficient parallel prediction. According to the results of many experiments, GRU-Transformer with time chain simulation has the best comprehensive performance. At the same time, we have gained some meaningful insights, as outlined below.
First, time chain simulation is able to reproduce the status of order fulfillment through changes in data. Based on the progressive structure of segmented processes, this method expands the feature dimensions so that the variation in dynamic and static information is more obvious. The MSE of the prediction results of GRU-Transformer with time chain simulation is reduced by about 4.83% when compared to GRU-Transformer using the original feature sequences. Moreover, the Wilcoxon test results of GRU-Transformer with time chain simulation and GRU-Transformer proved the significance of the performance difference. As a result, all these experiments validate the critical role of time chain simulation in fusing dynamic and static information in sequences.
Second, the GRU-Transformer architecture has a significant improvement in prediction performance compared to other models. Specifically, the performance of the architecture is improved by combining the GRU and Transformer, it reduces the MSE of the prediction results by about 5.20% compared to the Transformer. Moreover, the results of the Wilcoxon test for their paired sample indicate that there is a significant difference in performance between the two. All these results show that the simple GRUs module has an obvious benefit for GRU-Transformer in terms of improving its ability to capture temporal dependencies. In addition, it can also work in conjunction with the time chain simulation function to achieve effective parallel prediction of multi-point order time.
Third, for the temporal inconsistency in the order data, it not only reflects the changes in user demand at different times, but also demonstrates the model performance on unbalanced samples. In particular, the takeout system is in a busy state during peak periods, and a stable GRU-Transformer with time chain simulation is more useful at these times. During trough periods, although early morning complex situations can put pressure on the simulation effectiveness, it also means that the techniques used to balance the sample distribution have great potential in this regard.
In summary, the time chain simulation and GRU-Transformer proposed in this paper have achieved a significant improvement in the prediction accuracy of multi-point time before order creation. This not only has theoretical implications in terms of improving the model architecture for parallel prediction, but it also has practical implications for the application of intelligent takeout system.
6. Conclusions
This paper introduces a data processing method for time chain simulation that integrates dynamic and static information, and proposes a GRU-Transformer architecture to effectively predict the multi-point time of takeout food orders. The architecture is based on Encoder–Decoder as the main structure and incorporates the advantages of the GRU and Transformer, enabling it to capture both dependency relationships and focus on important features in the sequences. In addition, experiments are conducted using data from real-world takeout food orders. The experimental results show that GRU-Transformer performs well in the multi-point time prediction task, and the data processed in the form of time chains can be applied to capture the evolution of dynamic and static information more easily. However, when dealing with the temporal inconsistency of the data, GRU-Transformer shows less stable performance during trough periods. This suggests that there is room for improvement in its prediction accuracy during trough periods. To address this, future research can incorporate the over-sampling technique or other data augmentation methods, which balances the sample distribution by adding samples from underrepresented categories. This approach could help mitigate the temporal inconsistency of the data, ultimately improving the model’s ability to recognize and adapt to trough periods.
Finally, GRU-Transformer with time chain simulation predicts multi-point time that will help the OFD platform to control takeout food order fulfillment and can also be applied to the order allocation system to select the right rider for each order, thus improving the overall operational efficiency. Among them, the Rider Delivery Time Taken predicted in this paper can be applied to merchant cards and detail pages in the OFD platform, which can help to improve user retention rate. Moreover, the predicted Rider Arrival Time Taken and Rider Pickup Time Taken can be used to help riders plan their delivery schedules. Furthermore, when applying GRU-Transformer with time chain simulation to address the ETA problem in other domains, fine-tuning will be necessary. For instance, in online car-hailing services, passenger locations and traffic conditions vary across different times, leading to temporal inconsistency in the data. To address this, data augmentation techniques should be incorporated to mitigate the inconsistency and balance the data distribution. Additionally, new feature engineering should be applied to capture status changes in time chains under the new domain. Lastly, the Sequence Mask or other inner-architectures in GRU-Transformer should be modified to better suit the domain’s needs, ensuring proper masking of future information within sequences.