Attributation Analysis of Reinforcement Learning-Based Highway Driver
Round 1
Reviewer 1 Report
Title of the Paper: Attributation analysis of Reinforcement Learning based highway driver
0) Please describe briefly with your own words what this paper is about:
This paper describes one method of explainable AI (xAI) that is adequate to describe a Deep Reinforcement Learning (RL) solution to an autonomous driving problem. The writers acknowledge the goals of xAI in RL, explain why the used methods have to be different than for other Machine Learning (ML) problems and what the idiosyncrasies of Deep RL are. They pose the concept of human expectations and validation of xAI results and through that, they have a comparison tool that also debugs the ML models as well as checks the quality of the data. The researchers present the preliminaries of the mathematical basis of their work based on, their results, the problems they’ve encountered as well as future work directions.
1) Originality: Does the paper contain significant content to justify publication? What are novel aspects? Did you check for plagiarism, e.g. with a quick Google search?
The paper contains significant content to justify publication. The researchers study an issue that is quite new in current research; this still evolving and will continue to do so. It is not too novel though since this is quite an easy first idea; the question is if they will extend it to something more valuable and how they will approach future challenges in the autonomous driving domain. There are still yet come new ideas in this area.
2) Background and Related Work: Is there enough background and relevant related work? Are any relevant references missing? Where would you like to see more background? Please provide recommendations.
The writers could benefit from more background and related work. For the use of Mutual Information for the detection of non-linear correlations:
- MacKay, David JC, and David JC Mac Kay. Information theory, inference and learning algorithms. Cambridge university press, 2003.
One important paper about changing the architecture of a Neural Network with the use of xAI:
- Yeom, Seul-Ki, et al. "Pruning by explaining: A novel criterion for deep neural network pruning." Pattern Recognition 115 (2021): 107899.
Bringing the human needs in xAI research – particularly in RL needs also the following references:
- Taylor, Matthew Edmund, and W. A. N. G. Zhaodong. "Interactive reinforcement learning with dynamic reuse of prior knowledge." U.S. Patent No. 11,308,401. 19 Apr. 2022.
- Angerschmid, Alessa, et al. "Fairness and Explanation in AI-Informed Decision Making." Machine Learning and Knowledge Extraction 4.2 (2022): 556-579.
3) Methodology: Is the paper's argument built on an appropriate base of theory and concepts? Are the methods used appropriately described? Would you like to have more explanations?
Yes, overall the paper is based on the solid theoretical ground. For the non-linear correlations, please try Mutual Information as well. How is your method uncovering or taking into consideration the temporal dependence of consecutive states? This research raises the concept of expected behaviours / expected decision-making. How many human users did you have to ask? Are you planning a user study where you track eye movements or other human feedback than rules? In section 4.1. it is not clear what are the best model checkpoints. I suppose that the used NN is a feed-forward one, but there is no information about its concrete architecture and no GitHub repository. This is a major negative point of this paper; please provide all the information as well as further implementation details such as how big the dataset was, how much time the network needed for training and so on. At what step in Figure 1 “starts collapsing” and where is the highest mean sum of rewards? In 4.2.1. why are there 6 groups, is it a heuristic from domain knowledge? There are many assumptions and constants there, how did they come up? I would like more descriptions in Figure 2. What are the values of the groups of Rho (analogous to Pearson)?
4) Results: Are the results presented clearly and appropriately? Do the conclusions adequately tie together the other elements of the paper?
It would be great if the authors presented driving scenarios that xAI used during the training of the Deep NN providing results that alter the course of this training – a show for example that the Deep NN has not an appropriate reward scheme or that the maximum number of episodes is too small. To fix the corrupted model or the problem of vanishing/exploding gradients, please use some other xAI method (see proposed references). You will need more users in the future to test even more cases with eye-tracking methods and more rules. Try to write what changes will you undertake to explain why the values of attributation should be higher (section 5). Is the rule in lines 251, and 252 reasonable? What will you do in future work to find those factors 249, 250? One of the most important things that the researchers need to sketch – even if they don’t implement it in the current paper – is how they think they will correct the Deep NN according to human expectations. Furthermore, they need to think about what they will do if the human knowledge is not perfect (faulty). Therefore, they need more human experts. In the Discussion section, you mention it briefly; yes the drivers might learn new driving styles! Make a plan about what you can do against vanishing gradients, what xAI method can show you that you have this problem? The honesty about the problem description is highly appreciated; yes wrong implementation leads to wrong data distribution – what to do about it? Please think of use cases and scenarios in driving where the rules are the opposite of that and think about what the explanations have to look like.
5) Qualitative Evaluation: Is the paper well written? Is it clear, readable and comprehensive? Sentence structure, acronym explanation, typos, etc. ok?
Yes, the paper is well written, clear and comprehensive. If you can write more pages, please provide some acronyms. No typos were found.
6) Quantitative Evaluation: Given that the worst paper you have ever read receives 0 and the best paper ever receives 100 points – how many points would you assign to this paper: 60
FINAL RECOMMENDATION
C=Major Revision
I would only like to see the aforementioned problems addressed. I will be happy to see your future work!
Author Response
Dear Reviewer,
We greatly appreciate the thorough and thoughtful comments provided on our submitted article. We made sure that each comment has been addressed carefully either by adding in a manuscript or in the below-detailed responses. We attached the improved manuscript and the document with marked differences between the original and the new one.
Thank you for bringing our attention to the proposed literature which we included in the article. The references are marked as [1], [28], [14].
1. When it comes to the methodology we considered Mutual information as you suggested. We calculated the values of the highest Mutual information for the action of ACC agent with respect to the values of the ego's features and attributation. Table 4 shows the results of the calculations.
2. Regarding your question about the number of experts who took part in the project, I may answer that both authors work in the research team in the automotive company Aptiv. The research team which is involved in the project of planning behavior with Reinforcement Learning algorithms consists of 5 experts. For the last 4 years, we were developing the RL module carefully designing project requirements that our module should comply with. The performed analysis of the agent's behavior was conducted based on gathered so far knowledge. The same refers to your comment about the imperfection of human knowledge. We truly agree that the application of our method requires strong expertise in the problem which the agent faces. We are also aware of cases when RL agents outperform human knowledge, however in current work we have not noticed such behavior.
3. Regarding the idea of using an eye tracker by driver I must admit that we had thought about it during our research especially when the neighboring team was involved in the Simusafe project (simusafe.eu) and used such facilities to analyze human behavior in various traffic situations. However, their experience induced us to abandon this approach at this time and focus more on polling methods while keeping this approach for the future in our minds.
4. When it comes to the neural network architecture, dataset, and training process we have provided all details in section 4. We described the architecture of the neural network with the additional scheme in figure 1. We provided descriptions of agent observation space and detailed all terms of a reward function (equations 2nd and 3rd). We attached the description of the training process and gather all training hyperparameters in table 1. As you rightly noticed the term “best checkpoint” is rather ambiguous, that’s why we explain that further at the end of section 4.1. We also explained what “starts collapsing” means and at which step we found the best checkpoint for both agents.
5. In section 4.2.1 we consider 6 groups because there are 6 different maneuvers to be selected by the agent. The values for all the groups are stored in tables 2 and 3 for both coefficients.
6. Referring to the comments on results please see the answers below.
Solving the problem of vanishing gradients is not the main issue of the presented method. We added further explanations to article. During experimenting, we found that the analysis of attribution can detect this problem in individual regions of the model architecture. Such a problem occurs when the attributation of a given feature is close to zero for all samples in the dataset. We agree that there existed better methods for detecting and fixing vanishing gradients problems than the presented method.
7. As comes to the remark on the increasing number of experts and using eye-tracking methods, we absolutely agree and we added to the 5th section “Results” below lines:
“Expected attributation values and correlations are based on a review prepared by a team of 5 experts working on the use of AI/ML in autonomous vehicles. In the future, this type of analysis should be conducted based on extensive research of human eye movements in various driving scenarios.”
8. Considering the remarks on the discovered correlation in the result section (lines 249 – 252) we explained that further in the text :
“Another interesting correlation we found is the fact the agent seems to focus slightly more on other vehicles' positions when it is braking. This is also desired and justified effect because the braking intensity should depend on the intensity of changes in the object's position in relation to ego. When there is no need to brake, it is more important to know the speed of the objects, which determines the stability of the situation.”
9. Referring to the question of how we are going to improve our network we added some general ideas to the section 6 “application”:
“Furthermore, the results of the presented method may be utilized for improvement of the ANN architecture or to enhance the training process. Enhancement of the learning process may start by tuning the reward function to better represent the driver's objective. For example, if the results point out that the agent does not pay attention to the other objects then we propose to add to the reward some term that depends on the objects (e.x. reward based on time to collision metric).
On the other hand, ANN architecture may be further enhanced by redesigning the modules that process features that are neglected. By appealing to the issue of disregarding objects, we may propose to redesign that part for example by using proven architecture such as presented in [...]”
10. Reffering to the comment: Furthermore, they need to think about what they will do if the human knowledge is not perfect (faulty). Therefore, they need more human experts. In the Discussion section, you mention it briefly; yes the drivers might learn new driving styles!
We must say that we are aware of that problem and we need to reconsider it in the future work.
11. As for the question about our approach to vanishing gradients we do not want to further focus on that problem because the method is not intended to handle it. We just noticed that it could provide such useful information however I believe that there are a better solution for that matter.
12. The answer for the question “yes wrong implementation leads to wrong data distribution – what to do about it?“, we included in the article. Since the data distribution problem turned out to originate from the wrongly implemented function which normalized observation before putting them to neural network, we highly recommend improving the implementation and repeat experiment what we did.
13. Answering comment: “Please think of use cases and scenarios in driving where the rules are the opposite of that and think about what the explanations have to look like.”
Thank you for such insightful notice. Upon further reflection we came to the conclusion that such atypical scenarios are isolated cases therefore they should be considered individually.
Most of the statistical analysis was conducted in Minitab 11 software. Thresholds, good practices, and interpretation of coefficient meaning is based on their documentation. It is available here: https://support.minitab.com/en-us/minitab/21/ . To avoid temporal dependencies and ensure statistical independence of observation we sampled the observations from the attributation. From a set of 240000 evaluation steps in simulation, we sampled every tenth step for statistical analysis.
Sincerely
Authors
Nikodem Pankiewicz,
Paweł Kowalczyk
Author Response File:  Author Response.pdf
 Author Response.pdf
Reviewer 2 Report
The article deals with a timely and interesting topic. It presents the research process with precise justifications and clear, concise diagrams. The results are also well presented.
Author Response
Dear Reviewer
Thank you for your favorable opinion. We are pleased that you enjoyed our article.
Sincerely
Authors
Nikodem Pankiewicz,
Paweł Kowalczyk
Author Response File:  Author Response.pdf
 Author Response.pdf
Reviewer 3 Report
The paper finds that making such analysis allows for a better understanding of the agent’s decisions, inspecting its behavior, debugging the ANN model, and verifying the correctness of input values, which increases its credibility. The statistical methods applied to collect samples of agent decisions allow for recognition of agent’s behavior patterns by looking globally at overall behavior and not at individual action. This is achieved by analysis of attribution distribution differentiated by considered maneuver and juxtaposed with values of other parameters describing the situation. By inspecting the analysis results, we can seek confirmation that ANN concentrates on input features which are also important for a human driver. With an examination of the correlation between attribution and feature values, we find a pattern that matches human intuition and that is contrary to it. This knowledge helps us improve the model by changing model architecture, enhancing the training process, and ensuring that decisions are made in accordance with environment evaluation that prioritizes safety and effectiveness.
Following are some minor comments:
1.In the Figure 1, how did the reward come from the reinforcement learning? What criteria are there? You can write it down in detail.
2. More stages can be compared to improve the persuasiveness of data analysis. For example, you can exchange the action sequence of the Maneuver Agent.
3. The presentation should be further polished and improved.
Author Response
Dear Reviewer,
We greatly appreciate the thorough and thoughtful comments provided on our submitted article. We made sure that each comment has been addressed carefully either by addition in the manuscript or in the below detailed responses.
Referring to the 1st comment we add the detailed description of the reward function for both agents (equations 2nd, and 3rd ).
As for the 2nd comment, we avoid temporal dependencies and ensure statistical independence of observation we sampled the observations from the attributation. From a set of 240000 evaluation steps in simulation, we sampled every tenth step for statistical analysis.
When it comes to the article improvement, we reviewed in detail manuscript and polished it as well as we could. Please review the new version and document with a comparison of both versions.
Sincerely
Authors
Nikodem Pankiewicz,
Paweł Kowalczyk
Author Response File:  Author Response.pdf
 Author Response.pdf
Round 2
Reviewer 1 Report
The researchers overall did make the necessary effort to change their paper according to my propositions. The references and directions that I presented were taken into account. There are still some minor issues that need to be addressed:
- 
I am still puzzled about the ANN architecture details in section 4 and figure 1. It is not clear to me if it is in the category of Convolutional Neural Networks or not – since I am seeing residual blocks – and the input feature vectors listed in 4.1. seem to describe entities that have numerical values but do not necessarily have a spatial (or at least grid-like, positional, sequential) relationship. Therefore, it is not straightforward why this architecture is adequate. If it is a fully-connected NN, then it does make sense. Please provide the exact details about your NN, like the number of layers, the number of neurons in each layer, the exact size of the input, and the exact size of the intermediate outputs of each layer. When you write that the architecture differs between agents by the last control module part, how is that in practice? 
- 
Please provide an example of what a characteristic input looks like, as well as some statistics for the values. For example: what is the minimum, maximum, mean, and standard deviation of longitudinal acceleration (acc_s)? 
- 
The details about the reward are extremely important. What is missing is the description of the reward’s characteristics during training – especially the information if it is negative or positive. It is quite important to specify if this is a reward that has an increasing characteristic as the goal is approached or if it allows for non-optimal actions to be selected in favour of a future higher sum of rewards. Please provide its evolution during some training scenarios where you can give nice examples of what is happening with the vehicle and also two test set scenarios. 
- 
In figure 2 did you make some considerations and check if the model that is the best for the training test overfits – has a less good generalization performance on the test set? If not, please consider and mention this for future work. 
- 
Please provide some correlation values in figure 4 
- 
As far as the wrong implementation is referred to in lines 326-331 you can try to find that out with xAI methods; similarly for “tuning the reward function” as you rightfully mentioned in line 349. A nice 
- 
Tables 3 and 4 are highly appreciated, please provide some details about if this “makes sense” - are those values expected? Are they in accordance with your domain knowledge (even if it is out of scope to ask several people)? Even more interesting: is there some value in there that surprises you? Please analyze them and compare them a bit. 
Author Response
Dear Reviewer,
We greatly appreciate the further comments and insight on our article. We made sure that each comment has been addressed carefully either by adding in a manuscript or in the below-detailed responses. We attached the improved manuscript and the document with marked differences between the original and the new one.
I am still puzzled about the ANN architecture details in section 4 and figure 1. It is not clear to me if it is in the category of Convolutional Neural Networks or not – since I am seeing residual blocks – and the input feature vectors listed in 4.1. seem to describe entities that have numerical values but do not necessarily have a spatial (or at least grid-like, positional, sequential) relationship. Therefore, it is not straightforward why this architecture is adequate.If it is a fully-connected NN, then it does make sense. Please provide the exact details about your NN, like the number of layers, the number of neurons in each layer, the exact size of the input, and the exact size of the intermediate outputs of each layer. When you write that the architecture differs between agents by the last control module part, how is that in practice?
We used a Fully Connected Feed Forward Neural Network for experiments, please see updated figure 1 with more detailed NN architecture. This figure presents all network layers, their sizes, activations, and additional operations such as concatenation or max pooling. We did not use Convolutional Neural Networks at all, although we believe that spatial relationships between ego vehicles, road lanes, and objects may be understandable within fully connected layers. All spatial information is provided in Vehicle Coordinate System.
Figure 1 presents also two different control modules one for the ACC agent and the second one for the Maneuver agent. For each agent, only the corresponding module was created.
Please provide an example of what a characteristic input looks like, as well as some statistics for the values. For example: what is the minimum, maximum, mean, and standard deviation of longitudinal acceleration (acc_s)?
Examples of descriptive statistics regarding ego parameters before and after normalizations are now provided in table 1.
The details about the reward are extremely important. What is missing is the description of the reward’s characteristics during training – especially the information if it is negative or positive. It is quite important to specify if this is a reward that has an increasing characteristic as the goal is approached or if it allows for non-optimal actions to be selected in favour of a future higher sum of rewards. Please provide its evolution during some training scenarios where you can give nice examples of what is happening with the vehicle and also two test set scenarios.
Please consider the reward functions (equations 2 and 3). All terms are defined to return positive values. Therefore the sign of weight indicates whether it is positive reinforcement (+) or negative (-). The RL agent is trained to maximize the sum of rewards and it has to learn that sometimes it needs to sacrifice immediate reward to get higher in the short time horizon. For example, the ACC agent needs to accelerate, which gives him a negative reward for acceleration and jerk however it gets a higher reward for Speed limit execution in the end.
In figure 2 did you make some considerations and check if the model that is the best for the training test overfits – has a less good generalization performance on the test set? If not, please consider and mention this for future work.
Typically, in Reinforcement Learning methodology there is no distinction between training and testing. Generally, we use stochastic simulation which generates randomized scenarios. Agent actions introduce disturbances to the scenario and change the course of events. For the testing part, we used the same environment definition which causes some testing scenarios may be the same as those that agents met in training. This is possible but, in our opinion, identical scenarios almost never happen and due to the amount of data it is hard to verify if such a coincidence even happened. We also have several predefined scenarios which we used for testing the simplest cases, however, we expect that similar cases were also generated by a random process during training. The principle of training and testing in the same environment is widespread and universally applied in RL studies. Therefore it is very hard to consider the overfitting problem in this case.
Please provide some correlation values in figure 4
Correlation values corresponding to examples of visualizations in this figure are available in tables 2 and 3 in rows referring to ego_vel_s. Proper references were added to improve readability.
As far as the wrong implementation is referred to in lines 326-331 you can try to find that out with xAI methods; similarly for “tuning the reward function” as you rightfully mentioned in line 349.
Thank you for such considerable notice, we really appreciate that and we will consider it in our future work.
A nice Tables 3 and 4 are highly appreciated, please provide some details about if this “makes sense” - are those values expected? Are they in accordance with your domain knowledge (even if it is out of scope to ask several people)? Even more interesting: is there some value in there that surprises you? Please analyze them and compare them a bit.
The interpretation of table values is now expanded and better signalized in the main text, additionally, we described an example in which our correlation results for trained agents directly contradicted our expectations.
Sincerely
Authors
Nikodem Pankiewicz,
Paweł Kowalczyk
Author Response File:  Author Response.pdf
 Author Response.pdf
 
        


