Mixed-Graph Neural Network for Traffic Flow Prediction by Capturing Dynamic Spatiotemporal Correlations
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsThe article presents a traffic flow prediction model based on a mixed graph neural network combined with an attention mechanism, capable of capturing both spatial and temporal dependencies. Experiments conducted on open datasets gathered from real road demonstrate that the proposed model outperforms existing methods, particularly in long-term forecasting.
The introduction clearly articulates the novelty and contribution to the research field. Notably, the inclusion of an algorithm complexity analysis — an uncommon but valuable element — stands out as a strong aspect of the work. The experimental design is also of a high standard, with the authors accounting for a broad range of potential factors influencing the method’s performance. As a result, the conducted experiments provide a comprehensive evaluation of the proposed approach. A comparative analysis with similar algorithms is performed, and both the problem formulation and the introduction adhere to established academic standards. Open datasets are employed for evaluation, ensuring the transparency and reproducibility of results, and allowing the authors to substantiate claims regarding the superiority of their method over existing approaches.
The list of references is fully aligned with the article's content and reflects the current state of research in the domain.
Author Response
Comments 1: Limited The article presents a traffic flow prediction model based on a mixed graph neural network combined with an attention mechanism, capable of capturing both spatial and temporal dependencies. Experiments conducted on open datasets gathered from real road demonstrate that the proposed model outperforms existing methods, particularly in long-term forecasting.
The introduction clearly articulates the novelty and contribution to the research field. Notably, the inclusion of an algorithm complexity analysis — an uncommon but valuable element — stands out as a strong aspect of the work. The experimental design is also of a high standard, with the authors accounting for a broad range of potential factors influencing the method’s performance. As a result, the conducted experiments provide a comprehensive evaluation of the proposed approach. A comparative analysis with similar algorithms is performed, and both the problem formulation and the introduction adhere to established academic standards. Open datasets are employed for evaluation, ensuring the transparency and reproducibility of results, and allowing the authors to substantiate claims regarding the superiority of their method over existing approaches.
The list of references is fully aligned with the article's content and reflects the current state of research in the domain.
Response 1: Thank you for approving this paper.
Author Response File: Author Response.docx
Reviewer 2 Report
Comments and Suggestions for AuthorsGenerally, the paper focuses on the method more than the application. The paper should provide a more in depth discussion on a real application of the model.
Also, there are research that predicts traffic volumes in a rural context such as "Estimating Traffic Volume on Minor Roads at Rural Stop-Controlled Intersections using Deep Learning"
The contributions between lines 88 and98 should be rewritten to reflect a real contribution. For example, I do not consider the third point as a contribution. The model works for 4 datasets at the same area. The real contribution is the model's accuracy or the model structure. Also, I feel points 1 and 2 are within the model structure contribution.
Line 371, what does it mean the first 60% of the data and the last 40%? How the data was sorted? this point should be clarified.
Line 386 and before/after, although the authors tried to justify the selection of the hyperparameters, this makes the model very local to this dataset and may not work on a different dataset or different context (e.g., urban only).
Will the traffic characteristics have an impact on the model? i.e., the posted speed of the highways.
Comments on the Quality of English LanguageThe manuscript has lots of typos such as starting a sentence with a digit (it should start with "eight" not "8")
There are reference errors
Author Response
Comments 1: Generally, the paper focuses on the method more than the application. The paper should provide a more in depth discussion on a real application of the model.
Also, there are research that predicts traffic volumes in a rural context such as "Estimating Traffic Volume on Minor Roads at Rural Stop-Controlled Intersections using Deep Learning".
Response 1: Thanks a lot for the comment. The paper "Estimating Traffic Volume on Minor Roads at Rural Stop-Controlled Intersections using Deep Learning" mainly uses neural networks to improve safety performance functions based on linear regression. We believe that this is a typical application of deep learning in the field of traffic volumes prediction, so we cited it in our paper. Specifically, the following paper and its introduction were added to lines 116-118 in Section 2 on Page 3 in the revised version of the paper.
‘[13]Tawfeek, M.H.; El-Basyouny, K. Estimating Traffic Volume on Minor Roads at Rural Stop-Controlled Intersections using Deep Learning. Transportation Research Record 2019, 2673, 108–116, [https://doi.org/10.1177/0361198119837236].’
Comments 2: The contributions between lines 88 and 98 should be rewritten to reflect a real contribution. For example, I do not consider the third point as a contribution. The model works for 4 datasets at the same area. The real contribution is the model's accuracy or the model structure. Also, I feel points 1 and 2 are within the model structure contribution.
Response 2: Thank you very much for the comment. To finish this paper, one of our main works is to verify the effectiveness of our model, so we conduct many experiments on real traffic datasets and compare with many baseline models. Therefore, the experiment is considered as a contribution of this paper.
To demonstrate the performance of our model in real applications, experiments on Beijing Taxi dataset (BJTaxi), which is different from PEMS dataset, was added to the experiments of our paper in Sections 5.1 and 5.2 on Pages 11 and 12 in the revised version of the paper.In addition, in the old version of the paper, we summarized 4 contributions, where two of them (points 3 and 4) are about experiments. In the revised version of the paper, we combine these two points to one. Specifically, the point 3 at lines 94-96 in Section 1 on Page 3 in the revised version of the paper is revised as follows.‘Experiments on five real datasets show that the proposed model outperforms the baseline models in traffic flow prediction accuracy. In addition, experiments also show that the proposed model has strong performance and extensibility in long-term traffic flow prediction.’
Comments 3: Line 371, what does it mean the first 60% of the data and the last 40%? How the data was sorted? this point should be clarified.
Response 3: We appreciate for the comment. To be consistent with other studies, the traffic data are sorted by timestamp from early to late. Then, first 60% (earlier) data were used as training data and last 40% (later) data were used as validation and test data.To clarify this, the lines 374-376 in Section 5.1 on Page 11 in the revised version of the paper is revised as follows.‘In the experiment, the data are sorted according to the timestamp from small to large, and the first 60 percent of the traffic data is utilized as the training set and the last 40 percent is equally divided into the validation set and the test set.’
Comments 4: Line 386 and before/after, although the authors tried to justify the selection of the hyperparameters, this makes the model very local to this dataset and may not work on a different dataset or different context (e.g., urban only).
Response 4: Thank you very much for the comment. To evaluate the performance of our model on different types of datasets, the experiments of our model on Beijing Taxi dataset (BJTaxi), which collects traffic flow data of 290 taxis in Beijing city road (not highway) are added to Tables 1-3 in Sections 5.1 and 5.2 on Page 11 and 12 in the revised version of the paper. In the experiment, we did not fine-tune the hyperparameters on BJTaxi dataset, and its hyperparameters are the same as those on the PEMS03 and PEMS08 datasets.
The experimental results of our model and the baseline models on BJTaxi are added in Table 3. The proposed model achieves the best performance, which indicates that the proposed model can be used to predict other types of traffic flow data.
Comments 5: Will the traffic characteristics have an impact on the model? i.e., the posted speed of the highways.
Response 5: We appreciate for the comment. The model proposed in this paper is mainly used to predict the traffic flow data, so we do not consider the traffic speed characteristics in our model. In the future, we can try to transfer our model to predict different types of traffic data.
Author Response File: Author Response.docx
Reviewer 3 Report
Comments and Suggestions for AuthorsI think this is a well written and interesting paper on combining static and dynamic prediction of traffic flow. I have a few comments though that need to be considered.
Lines 105-106: There is something wrong with the first two sentences in this section. First you say "a lot of research has been done on traffic flow prediction from different perspectives". They you say "One of the typical problems of spatiotemporal data prediction is traffic flow prediction". One would expect the second sentence to specialize into one of those many perspectives of traffic flow prediction. Or maybe the sentences are supposed to switch order. Or the first sentence is just superfluous.
Lines 173-175: In the problem description you introduce the concept of "spatial correlation weight", but you do not formally explain wat it means. The sentence "The higher the spatial correlation between two nodes, the greater their weight value." does not help, since "spatial correlation" has only been used informally up till here. I suppose it is the correlation coefficient between the traffic flows through the nodes, taken at the same times? I suggest to clarify this in the sentence.
Line 240: Here you introduce tau in the equations, but it is not explained until line 326.
Lines 378 and forward: I find it problematic with fine tuning the hyper parameters separately for each data set. I can understand that different variability in the data puts different demands on the architecture. However, it makes it more difficult for others to use the method if you can not recommend a standard setting of these parameters. It also opens up for suspicions of over training, since you may have let the prediction results guide the selection of hyper parameters (intentionally or not). Maybe most important, when you compared with the other methods, did you allow them to fine-tune parameters separately on each data set, or did you use the same standard settings for all data sets? In the latter case, this makes comparison with the other methods unfair.
Lines 478: "the traffic flow prediction performance of STCDN is worse than that of the proposed model in most metrics." I think it would be fair to achnowledge that STCDN is very close in performance to the proposed method. "which indicates that STCDN overfits the jitter in traffic flow," I think this explanation is ad hoc and non-supported. Generally, the RMSE metric should be more robust against "jitter", and it is the MAE that may allow for occasional larger deviations in predictions.
Line 576: "Section V-D" should probably be "Section 5.4"
Lines 583 and forward: I do not understand the argument about table 7, and suspect that the rationale for the calculation is flawed. In the first line of the table you present the differences in performance when removing GAT or GCN. The difference when removing GAT is smaller than of removing GCN. The same is true when training on Peak-hours only and on Off-peak hours only. But then you argue that because the ratio between the differences between "peak-only" and "complete" lines are larger for GAT than GCN (and the same for "off-peak" vs "complete"), the improvement caused by GAT is higher. But this is only because the improvement of GAT in the complete data is so small which exaggerates the ratio. The fact is that in absolute numbers GCN contributes the largest improvements for all three versions of data.
Line 599: In figure 6 you show the entropy of GAT in peak and off-peak hours separately, but for GCN for the complete data only. This makes it difficult to compare. I suggest that you show both GAT and GCN for all of peak, off-peak, and complete data. Especially since (a) and (b) show a clear difference in entropy, indicating that the selection of data part matters.
Author Response
Comments 1: I think this is a well written and interesting paper on combining static and dynamic prediction of traffic flow. I have a few comments though that need to be considered.
Lines 105-106: There is something wrong with the first two sentences in this section. First you say "a lot of research has been done on traffic flow prediction from different perspectives". They you say "One of the typical problems of spatiotemporal data prediction is traffic flow prediction". One would expect the second sentence to specialize into one of those many perspectives of traffic flow prediction. Or maybe the sentences are supposed to switch order. Or the first sentence is just superfluous.
Response 1: Thanks a lot for this comment. We have switched the order of these two sentences on lines 103-105 in Section 2 on Page 3 in the revised version of the paper.
Comments 2: Lines 173-175: In the problem description you introduce the concept of "spatial correlation weight", but you do not formally explain wat it means. The sentence "The higher the spatial correlation between two nodes, the greater their weight value." does not help, since "spatial correlation" has only been used informally up till here. I suppose it is the correlation coefficient between the traffic flows through the nodes, taken at the same times? I suggest to clarify this in the sentence.
Response 2: Thank you very much for the comment. The spatial correlation between two road segments is usually represented by the Euclidean distance between them. We have added an explanation on lines 174-176 in Section 3 on Page 5 in the revised version of the paper.
Comments 3: Line 240: Here you introduce tau in the equations, but it is not explained until line 326.
Response 3: Thanks a lot for the comment. We have added an explanation of tau on lines 241-242 in Section 4.1 on Page 7 in the revised version of the paper.
Comments 4: Lines 378 and forward: I find it problematic with fine tuning the hyper parameters separately for each data set. I can understand that different variability in the data puts different demands on the architecture. However, it makes it more difficult for others to use the method if you can not recommend a standard setting of these parameters. It also opens up for suspicions of over training, since you may have let the prediction results guide the selection of hyper parameters (intentionally or not). Maybe most important, when you compared with the other methods, did you allow them to fine-tune parameters separately on each data set, or did you use the same standard settings for all data sets? In the latter case, this makes comparison with the other methods unfair.
Response 4: We appreciate for the comment. In our opinion, it is common to adjust hyperparameters and achieve the best performance of the model in the experiments, and other models have also adjusted their hyperparameters. For example, in STSGCN (DOI: https://doi.org/10.1609/aaai.v34i01.5438) the author mentioned that ‘The hyperparameters are determined by the model's performance on the validation datasets.’ This means that the authors adjusted the hyperparameters based on the model's performance on the validation datasets. A similar statement is made in ASTGCN (DOI: https://doi.org/10.1609/aaai.v33i01.3301922), ‘We test the number of the terms of Chebyshev polynomial K ∈{1,2,3}. Considering the computing efficiency and the degree of improvement of the forecasting performance, we set K = 3.’
In experiments, the hyperparameter settings of the baseline models are completely consistent with their original papers, which is to ensure the fairness of the experiment, and it is to ensure that each model achieves their best experimental results.
In addition, we have explained the guidelines for hyperparameter settings on the 3rd paragraph of Section 5.1 on Page 11 in the revised version of the paper. It is reasonable to adjust hyperparameters according to the data size to enable better convergence of the model.
Comments 5: Lines 478: "the traffic flow prediction performance of STCDN is worse than that of the proposed model in most metrics." I think it would be fair to achnowledge that STCDN is very close in performance to the proposed method. " which indicates that STCDN overfits the jitter in traffic flow," I think this explanation is ad hoc and non-supported. Generally, the RMSE metric should be more robust against "jitter", and it is the MAE that may allow for occasional larger deviations in predictions.
Response 5: Thank you very much for the comment. Compared with MAE and MAPE, RMSE is more sensitive to large errors. The experimental results show that the proposed model performs better than STCDN on most of the data, while it has relatively large prediction errors on a small amount of data. However, compared to the MAE and MAPE gaps between the proposed model and STCDN, the RMSE gaps between them are smaller. We have added new explanation in the 2nd last paragraph of Section 5.2 on lines 485-488 on Page 14 in the revised version of the paper.
Comments 6: Line 576: "Section V-D" should probably be "Section 5.4".
Response 6: Thanks a lot for the comment. We have fixed this mistake in line 502 on Page 14 in the revised version of the paper.
Comments 7: Lines 583 and forward: I do not understand the argument about table 7, and suspect that the rationale for the calculation is flawed. In the first line of the table you present the differences in performance when removing GAT or GCN. The difference when removing GAT is smaller than of removing GCN. The same is true when training on Peak-hours only and on Off-peak hours only. But then you argue that because the ratio between the differences between "peak-only" and "complete" lines are larger for GAT than GCN (and the same for "off-peak" vs "complete"), the improvement caused by GAT is higher. But this is only because the improvement of GAT in the complete data is so small which exaggerates the ratio. The fact is that in absolute numbers GCN contributes the largest improvements for all three versions of data.
Response 7: Thank you very much for the comment. The absolute values of GCN improvement are larger than that of GAT. However, it can be seen from Table 7 that the absolute value difference between GAT and GCN improvement is the smallest during the peak hours, which indicates that GAT is more sensitive to peak hours data than off-peak hours and complete data. In our model, GAT and GCN are complementary each other, neither GAT nor GCN can achieve the best results by themselves, and only by combining the two, the proposed model achieves the best performance.
We modify the explanations in the first and the last paragraphs of Section 5.6 in lines 590-596 on Page 17 and lines 606-617 on Page 18 in the revised version of the paper, so as to illustrate how GAT and GCN complement each other in terms of entropy.
Comments 8: Line 599: In figure 6 you show the entropy of GAT in peak and off-peak hours separately, but for GCN for the complete data only. This makes it difficult to compare. I suggest that you show both GAT and GCN for all of peak, off-peak, and complete data. Especially since (a) and (b) show a clear difference in entropy, indicating that the selection of data part matters.
Response 8: We appreciate for the comment. Referring to the GCN equation (Equation (4) in Section 4.2.1 on Page 8 in the revised version of the paper), the calculation results of D^(1/2)AD^(1/2) is fixed, so the entropy of GCN is the same whether it is peak hours or off-peak hours.
Author Response File: Author Response.docx
Round 2
Reviewer 2 Report
Comments and Suggestions for AuthorsThe paper has significantly improved.
Why did not the author use BjTaxi dataset for long term prediction (Table 4)?
Author Response
Comments 1: Why did not the author use BjTaxi dataset for long term prediction (Table 4)?
Response 1: Thanks a lot for the comment. We conducted the long term prediction experiments on BJTaxi dataset. The experimental results of BJTaxi were added to Table 4 in Section 5.3 on Page 14 in the revised version of the paper. It should be noted that the PeMS dataset records data every 5 minutes, while BJTaxi records data every 30 minutes. Therefore, one ‘time interval’ represents 5 minutes in PeMS and 30 minutes in BJTaxi. To unify the representation of the time, the header of Table 4 is also modified to time intervals.
Reviewer 3 Report
Comments and Suggestions for AuthorsThank you for your replies to my concerns. I am happy with most of the clarifications, but there are some issues remaining:
Regarding response 2, on the spatial correlation weights: You may
consider this a minor detail, and maybe it is, but the clarification
you added just increases the confusion, so lets clear it out.
First, calling it "spatial correlation weights" may be confusing for
the reader, since everywhere else you refer to "spatial correlations"
as the correlation of traffic flows between spatially separated road
segments (as opposed to temporal correlations). This appears to be
learnable parameters in the models, and even dynamic parameters in the
case of GAT. Maybe you should in contrast call this fixed matrix,
which only depends on the Euclidean distances, something else to
reduce confusion.
Second, the sentence starting on line 174 ("The higher the...") and
the added sentence starting on line 175 ("In practice,") are
contradictory, since the weight should probably be higher the closer
the road segments are, ie a smaller distance have a higher weight. So
it isn't the Euclidean distance directly. What is it then? The
negative of the distance, the reciprocal of the distance, or some
other expression? And is it the distance between the midpoints of the
road segments, or the closest ends of the segments, or the closest
distance between any pairs of point on the segments? Is there any
normalization of its values, eg to the range [0,1]? I think you can
afford to add the formal definition of this matrix as an equation, and
instead get rid of the both of the two sentences (starting on lines
174 and 175) because they are not informative.
Finally, when now quickly searching for where this matrix is used, I
can only find it referred to again at line 344, where you count the
number of non-zero elements to find the number of edges. This makes
sense from the context, that it is the connectivity matrix indicating
which nodes are connected in the graph. Since you do not seem to use
the weights (or did I miss something), maybe it can actually be just
zeroes or ones, depending on whether two road segments are connected
or not?
Regarding response 4, on hyperparameters: There is some confusion in
the literature on what is called validation versus test set. The most
common version, also found on Wikipedia ("Training, validation, and
test data sets"), is that the validation set is used for checking
training convergence for early stopping and selecting hyper parameters
of the models, whereas the test set is completely isolated from the
parameter selection and training phase and only used for assessing the
final performance of the model. (However, I have also seen examples
where the terms are used the other way around.) According to your line
378 it appears that you have used the validation set to assess the
performance of the trained model (and presumably the test set is then
used for checking when training has converged? Or else what do you use
the test set for?).
Using the same data set that is used as stopping criterion during
training (in your case the test set) for also selecting hyper
parameters is perfectly fine. Using the same data set as is used for
final assessment of performance (in your case the validation set) is
not ok. Setting the hyperparameters to maximize the final performance
metric of the model is a certain recipe for overtraining. I have not
checked how the references you give in your reply use the terms
"validation" and "test". However, that someone else did something
wrong is not a valid defense.
Since in your case the test and validation data sets are equally
large, and if your method is as reliable and robust as you believe,
then you should end up with (almost) the same hyperparameters if you
use the test set to find the optimal parameters instead. If you are
lucky you can just update the description of how they were found. On
the other hand, if the hyperparameters varies wildly, this is an
indication that the method is not very robust, and the provided
results are just by fluke.
Additional minor things:
Line 178, dangling reference.
Line 380: This is for PEMS. For clarity, provide the corresponding
numbers for BJTaxi.
Otherwise I think this is a nice paper, and hope that you will be able to sort out the above issues.
Author Response
Comments 1: Regarding response 2, on the spatial correlation weights: You may consider this a minor detail, and maybe it is, but the clarification you added just increases the confusion, so lets clear it out.
First, calling it "spatial correlation weights" may be confusing for the reader, since everywhere else you refer to "spatial correlations" as the correlation of traffic flows between spatially separated road segments (as opposed to temporal correlations). This appears to be learnable parameters in the models, and even dynamic parameters in the case of GAT. Maybe you should in contrast call this fixed matrix, which only depends on the Euclidean distances, something else to reduce confusion.
Second, the sentence starting on line 174 ("The higher the...") and the added sentence starting on line 175 ("In practice,") are contradictory, since the weight should probably be higher the closer the road segments are, ie a smaller distance have a higher weight. So it isn't the Euclidean distance directly. What is it then? The negative of the distance, the reciprocal of the distance, or some other expression? And is it the distance between the midpoints of the road segments, or the closest ends of the segments, or the closest distance between any pairs of point on the segments? Is there any normalization of its values, eg to the range [0,1]? I think you can afford to add the formal definition of this matrix as an equation, and instead get rid of the both of the two sentences (starting on lines 174 and 175) because they are not informative.
Response 1: We appreciate for the comment. The ‘spatial correlation weights’ in lines 173 and 174 on Page 5 and in line 343 on Page 10 in the revised version of the paper are revised to ‘road network weights’ in the revised version of the paper to avoid the confusion.
We calculated the Euclidean distance between connected road segments and then performed min-max normalization on these distances. It should be noted that "connected road segments" are referred to as "adjacent road segments" in the revised version of the paper. The road network weight between two road segments is the reciprocal of the normalized data. It is worth noting that a roadside facility is installed on each road segment to collect traffic data, and the Euclidean distance between two road segments here actually refers to the distance between the two roadside facilities.
We have added an explanation on lines 173-177 in Section 3 on Page 5 in the revised version of the paper.
Comments 2: Finally, when now quickly searching for where this matrix is used, I can only find it referred to again at line 344, where you count the number of non-zero elements to find the number of edges. This makes sense from the context, that it is the connectivity matrix indicating which nodes are connected in the graph. Since you do not seem to use the weights (or did I miss something), maybe it can actually be just zeroes or ones, depending on whether two road segments are connected or not?
Response 2: Thank you very much for the comment. We mentioned the spatial relationship matrix (i.e., the adjacency matrix A in the revised version of the paper) for 3 times in the original paper.
- In Equation 4 on Page 8 in the revised version of the paper, the adjacency matrix A is used to calculate the static spatial traffic flow characteristics.
- In Equation 6 on Page 8, in Equation 6 is calculated only among adjacent nodes (i.e., adjacent road segments). Therefore, it is necessary to find adjacent nodes based on the adjacency matrix A, i.e., nodes with a road network weight greater than 0.
- In line 343 in Section 4.4 on Page 10, during complexity analysis, it is necessary to determine the number of edges based on the adjacency matrix A.
Comments 3: Regarding response 4, on hyperparameters: There is some confusion in the literature on what is called validation versus test set. The most common version, also found on Wikipedia ("Training, validation, and test data sets"), is that the validation set is used for checking training convergence for early stopping and selecting hyper parameters of the models, whereas the test set is completely isolated from the parameter selection and training phase and only used for assessing the final performance of the model. (However, I have also seen examples where the terms are used the other way around.) According to your line 378 it appears that you have used the validation set to assess the performance of the trained model (and presumably the test set is then used for checking when training has converged? Or else what do you use the test set for?).
Using the same data set that is used as stopping criterion during training (in your case the test set) for also selecting hyper parameters is perfectly fine. Using the same data set as is used for final assessment of performance (in your case the validation set) is not ok. Setting the hyperparameters to maximize the final performance metric of the model is a certain recipe for overtraining. I have not checked how the references you give in your reply use the terms "validation" and "test". However, that someone else did something wrong is not a valid defense.
Since in your case the test and validation data sets are equally large, and if your method is as reliable and robust as you believe, then you should end up with (almost) the same hyperparameters if you use the test set to find the optimal parameters instead. If you are lucky you can just update the description of how they were found. On the other hand, if the hyperparameters varies wildly, this is an indication that the method is not very robust, and the provided results are just by fluke.
Response 3: We appreciate for the comment. In the experiments, the data in the dataset are sorted by time stamp in ascending order and divided into three sets: the training set, the validation set, and the test set, with the proportions of 60%, 20%, and 20% respectively. To ensure the fairness of the experiment, the dataset division way and proportions used in this paper are the same as that in most baseline models. Therefore, we did not change the proportions of the validation set and the test set, and their proportions are the same.
We use the training set to train the model, and the validation set to evaluate the model's performance in each training epoch and check whether the model converges. The model parameters that perform best on the validation set are recorded. After the training is completed, we evaluate the prediction accuracy (i.e., the 3 error metrices, MAE, MAPE, RMSE) of our and baseline models on the test set, which are shown in Table 3. It should be emphasized that the model has never seen the test set data during the training process. We adjust the hyperparameters only based on the validation set, not on the test set.
3 out of the 7 hyperparameters in Table 2 on Page 11 are different in different datasets and their range of variation are not big. In addition, a smaller learning rate is intended to enable better convergence of larger parameter models, and it even has no direct correlation with the proposed model itself.
To avoid confusion, we modified the description in lines 375-378 in Section 5.1 on Page 11 in the revised version of the paper.
Comments 4: Line 178, dangling reference.
Response 4: Thank you very much for the comment. We have fixed this error on line 178 in Section 3 on Page 5 in the revised version of the paper.
Comments 5: Line 380: This is for PEMS. For clarity, provide the corresponding numbers for BJTaxi.
Response 5: Thanks a lot for the comment. On the BJTaxi dataset, we also predict the subsequent 12 time intervals based on the previous 12 time intervals. However, it should be noted that the PeMS dataset records data every 5 minutes, while BJTaxi records data every 30 minutes. Therefore, one ‘time interval’ represents 5 minutes in PeMS and 30 minutes in BJTaxi. We explained their time difference on line 381 in Section 5.1 on Page 11 in the revised version of the paper.
Round 3
Reviewer 3 Report
Comments and Suggestions for AuthorsThanks for your clarifying comments. I am now happy with the result.
I have just two minor adjustment suggestions in regard to the
discussion on validation/test-sets: I am happy that we cleared out
that you are indeed following the correct machine learning
protocol. However, to make it even clearer I suggest extending the
sentence on lines 375-377 a little:
"The validation set is used to evaluate the model’s performance in
each training epoch *to* check whether the model converges, *and for
tuning of hyper parameters*."
Then you tell about the usage of the test set on line 377, which is
also good.
However, the new sentence on line 378, "The model has never seen the
test set data during the training process." is now superfluous and can
be removed. Once we cleared out the misunderstanding and you made it
clear with the previous two sentences in the text, this is implicitly
understood, and need not be stated explicitly.
Author Response
Comments 1: Thanks for your clarifying comments. I am now happy with the result.
I have just two minor adjustment suggestions in regard to the discussion on validation/test-sets: I am happy that we cleared out that you are indeed following the correct machine learning protocol. However, to make it even clearer I suggest extending the sentence on lines 375-377 a little:
"The validation set is used to evaluate the model’s performance in each training epoch *to* check whether the model converges, *and for tuning of hyper parameters*."
Then you tell about the usage of the test set on line 377, which is also good.
However, the new sentence on line 378, "The model has never seen the test set data during the training process." is now superfluous and can be removed. Once we cleared out the misunderstanding and you made it clear with the previous two sentences in the text, this is implicitly understood, and need not be stated explicitly.
Response: We appreciate the comment, which will make the paper clearer and more rigorous. We have revised the description of the role of the validation set on lines 375-377 in Section 5.1 on Page 11. Additionally, we have removed the sentence on lines 378, “The model has never seen the test set data during the training process.” Finally, we would like to thank you again for your rigor.