Next Article in Journal
Motion Signal Processing for a Remote Gas Metal Arc Welding Application
Previous Article in Journal
Wheeled Robot Dedicated to the Evaluation of the Technical Condition of Large-Dimension Engineering Structures
 
 
Article
Peer-Review Record

Adjustable and Adaptive Control for an Unstable Mobile Robot Using Imitation Learning with Trajectory Optimization

by Christian Dengler * and Boris Lohmann
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Reviewer 3: Anonymous
Submission received: 21 February 2020 / Revised: 16 April 2020 / Accepted: 22 April 2020 / Published: 25 April 2020

Round 1

Reviewer 1 Report

The paper proposes the development of a parametric feedback controller applied to a mobile inverted pendulum. The proposed method uses imitation learning on optimized trajectories.

The principal contribution of the paper is the use of an RNN to reduce the gap between simulation environments and transference of the control system to the real world.
The use of an RNN improves the system performance in the real-world application as shown by the author through the paper.

The paper focuses on facilitating and improving the transfer of the control system from a virtual to a real environment based on offline training and online parameter readjustment. This proposal gives great relevance to the proposed system and makes this article to present a reasonable contribution.
However, the system design proposal is not entirely new, since there is a great structural similarity between the system proposed by this paper and the one proposed by Peng et. al. in [9]. To improve and clarify the contribution of the paper I advise:
1) Section 3 should have one more paragraph where a deeper comparison is made between the reference [9] and the proposed system in order to contrast the differences and reinforce the contribution, in addition to clarifying how those differences improve the proposed system.

2) Lines 406 and 407 are not clear, perhaps it is a problem of writing or the work of the Rasberry pi should be clarified better.

3) It is necessary to include before the conclusion section a paragraph that specifies possible variations to the system and the different results that the authors believe that such variations would produce, in order to provide the scientific community with a thread to follow in this line of research and strengthen the impact of the proposed system.

4) The paper emphasizes the application of the proposed system to an inverted pendulum type robot, this is fine, but since it is wanted to strengthen the contribution of the paper in the transfer of the system from a simulated environment to a real platform, authors should include a reel the images that show the sequence of operation of the proposed system both in the simulation as in the real platform. This will generate more impact on the reader and reinforce the idea of ​​the strength of the system in the area of ​​implementation.

Author Response

We‘d like to thank the reviewer for the constructive feedback. Indeed we agree on the points raised and took action to improve our paper with special focus on the presentation of the results. We structure our response by first considering the enumerated feedback points and then mentioning other noteworthy changes.

1) Section 3 should have one more paragraph where a deeper comparison is made between the reference [9] …
We further clarified the differences from [9] and the advantages of the proposed approach by adding a paragraph at the beginning of Section 3. We also added the reference Paul et al. [35] to underline a disadvantage of [9] that is not present in our approach.

2) Lines 406 and 407 are not clear, perhaps it is a problem of writing or the work of the Rasberry pi should be clarified better.

We wanted to refer to the Raspberry Pi, presented in the hardware description in 4.1 . Since it caused confusion we decided to simply remove the repetition of using a Raspberry Pi as it is not essential to the presentation of the results.

3) It is necessary to include before the conclusion section a paragraph that specifies possible variations …

We included a paragraph between the results and the conclusion. We were unsure about which type of variations to the system we are asked to discuss (task/harware/model) and hope that we could comply with the request by adding the paragraph „Outlook of application specific variations“ where we shortly describe variations for new possible tasks.

4) The paper emphasizes the application of the proposed system to an inverted pendulum type robot…

We added image sequences of the real mobile inverted pendulum and the simulated inverted pendulum. We also added a video link, where we show the system performance when using the recurrent neural network controller.

5) Further changes
We added an appendix, repeating the model equations provided in Phatak et al. [46] in order to facilitate the understanding of the characteristics of the controlled system.
Other minor changes were made to clarify ideas, to provent possible confusions
or explain in more detail. E.g. the descriptions of figure 5, 6, 7 was expanded, cost functions explained a bit more etc.

Reviewer 2 Report

The article is interesting and well written on the whole, but some improvements are required to enhance the solidity of the contribution:

- The mathematical model of the MIP is completely omitted, stating that the one developed in [45] has been adopted. Even if known, such a model should be briefly described in the paper, properly citing [45] in which it was derived, otherwise the reader has some difficulty in fully understanding the characteristics of the considered case study, which is the core of the contribution. Also the friction and drive model in (11) seems quite obscure without any other information about the system dynamics.

- No explanation is provided for the cost functions in (13) and (15). How have they been determined? Which are the criteria adopted to choose their parameters?

- The control performance has been tested considering a sequence of target positions, i.e., in a sequence of regulations to constant desired configurations. For the mobile base of the inverted pendulum this is not very significant in practice, since the robot is expected to move along some predefined trajectory to be followed. Would it be possible to apply the developed control solution to such a case and show the performances that could be achieved? This would significantly strengthen the contribution of the paper.

Author Response

We‘d like to thank the reviewer for the constructive feedback. We agree on the points raised and took action to improve our paper with special focus on the presentation of the results. Our response to the specific points is given first, then we shorty mention further changes that were made.

- The mathematical model of the MIP is completely omitted, …
It is true that it would help the understanding of the system properties. In order to not interrupt the reading flow, we added an appendix where we provide the lengthy equations and reference them in the „modelling“ section.

- No explanation is provided for the cost functions in (13) and (15). How have they been determined? Which are the criteria adopted to choose their parameters?
We added a description of the cost function and how we chose the constants for the cost functions (13) and (15) that comes right after the equations.

- The control performance has been tested considering a sequence of target positions, i.e., …
We added a video link, that shows such an application. Since the controller was trained to target stationary targets however, the MIP will keep a certain distance to moving targets. We clarified this in the new paragraph 5.4, and also note that for a closer tracking of moving target positions, data of moving targets must be used.

Further changes
We added image sequences to qualitatively show the control behavior in a more accessible way (figure 8 and 9). The video in the given link also shows our controller for stationary targets.
Other minor changes were made to clarify ideas, to provent possible confusions or explain more. E.g. we more implicitly pointed out differences of our contribution with Peng et al. [9] in section 3, we reworked the description of figure 5, 6, 7 etc.

Reviewer 3 Report

The paper robotics-739580 presents a parametric controller for stabilizing a mobile inverted pendulum robot during a motion task. This paper presents various different approaches for addressing the problem of resolving the problem of dynamic adjustment of control parameters during robot motion tasks. A common aspect of all presented control approaches are that they are based in imitation learning scheme using as example optimized trajectories.

 

Does the introduction provide sufficient background and include all relevant references?

 

This paper is well introduced by presenting the problem of unstable systems control showing some of the current state-of-the-art approaches works in the literature. The authors noted the current existing gap in most of the current works which trained parametric controller in simulated environment are transferred to real world without considering the inaccurate simulated physics. The authors summarize related works in to main groups, robust parametric controllers and imitation learning with trajectory optimization.

 

Is the research design appropriate?

 

Here, it starts one of the most serious problem in this work. Although, the authors during the abstract and introduction pointed the work to resolve the problem of controlling an mobile inverted pendulum by proposing an approach, after reading the paper it is not clear which is the genuine method proposed here. The felling of this reviewer is that this work is just an preliminary evaluation of different supervised learning tools for addressing this problem, but far for having solid result for being published. Why do not the authors compare their DOI approach against DAGGER and DART strategies? (from my point of view, this maybe the unique contribution of this work).

 

Are the methods adequately described?

 

First, the authors added noise in the presentation of this paper like whole section 3.1, this section can be compressed by just referring to the readers to the book of “J. Nocedal, S. Wright. Numerical Optimization” which propose many more examples of how an optimization problem can be addressed or can be understand. Secondly interesting derivations of objective functions are poorly explained or simply not explained like eq. 13 and 15, where it came all those term from? which are the meaning of the constants?

 

In other hand, there are some incongruence in the formulation purposed in this work (equations and algorithms), for example, in algorithm 1, when is generated x_1?

 

It is not clear derivation of eq. 7 from eq. 6.

 

Are the results clearly presented?

 

Although the authors propose various different metrics for evaluate the performance of this framework, they are not clearly presented for being interpreted the results by readers. For example, results in table 2 and 3, their quantities what express? As bigger is the number is better? Really are the methods comparable?

 

Figures 5, 6 and 7 are not properly discussed, how the three methods are comparable

 

 

Are the conclusions supported by the results?

 

Nothing to comment.

Author Response

We‘d like to thank the reviewer for the constructive feedback. We took action to improve on the mentioned points with special focus on the presentation of results.

after reading the paper it is not clear which is the genuine method proposed here. …
The method is the 3 step method, described in section 3 that allows training a recurrent neural network using trajectory optimization. We believe some confusion might have been caused by comparing the final controller with to many intermediate controllers, which is why we clarified which is the main controller at the beginning of section 5.

...Why do not the authors compare their DOI approach against DAGGER and DART strategies? (from my point of view, this maybe the unique contribution of this work).
It is our belief that it would be an unfair comparison in our favor as neither DAGGER nor DART are particulary suited for this use cases. DOI is not an alternative for DAGGER or DART, but serves a different purpose.
DAGGER is aggregating data to avoid too much use of a (human) supervisor. Since we don‘t use a human supervisor, there is no need to save all past data.
DART is an off-policy method and not suited for t
raining recurrent controller.
We took the on-policy part of DAGGER with the disturbances from DART specifically for our use case.

First, the authors added noise in the presentation of this paper like whole section 3.1, this section can be compressed by just referring to the readers to the book of “J. Nocedal, S. Wright. Numerical Optimization” …
It is true that there are many books on optimization and we are aware of some. However, since trajectory optimization is a crucial part in this paper, we felt the need to do more than just ask the reader to read a book on the topic. We feel that one page is usefull to most reader and we would like to keep it in the paper.

... Secondly interesting derivations of objective functions are poorly explained or simply not explained like eq. 13 and 15, where it came all those term from? which are the meaning of the constants?…
We added a description of the cost function and how we chose the constants for the cost functions (13) and (15) that comes right after the equations.

In other hand, there are some incongruence in the formulation purposed in this work (equations and algorithms), for example, in algorithm 1, when is generated x_1?
This was a mistake from our side and we corrected the algorithm to start with t=0.

It is not clear derivation of eq. 7 from eq. 6.
It is rather a short notation, and there is no derivation between the two. We clarified this more in the text between both and put u_t, h_t in a vector to be mathematically correct and avoid confusion.

Although the authors propose various different metrics for evaluate the performance of this framework, they are not clearly presented for being interpreted the results by readers. For example, results in table 2 and 3, their quantities what express? As bigger is the number is better? Really are the methods comparable?
To reduce possible confusions, we repeated in section 5.2 that the metrics represent costs and as such small numbers are better. We also described the intention behind the cost functions used a little more.

Figures 5, 6 and 7 are not properly discussed, how the three methods are comparable
We now describe the figures more in depth and point out the differences of the controllers more.

Further changes:
Many minor changes were made to clarify ideas, to prevent possible confusions or explain more.
E.g. we more implicitly pointed out differences of our contribution with Peng et al. [9] in section 3. We included a new paragraph between the results and the conclusion „Outlook of application specific variations“ where we shortly describe variations for new possible tasks. We added image sequences to qualitatively show the control behavior in a more accessible way (figure 8 and 9). We added a video link where our controller is shown for stationary and moving targets. We added an appendix repeating the model equations given in Pathak et al [46].

Round 2

Reviewer 2 Report

The auhtors have addressed all my previous concerns in a satisfying way.

Author Response

We would like to thank the reviewer again for helping to improve our paper. In the new version of the paper, some small changes have been made to clarify possible points of confusion. Also a mistake in the units in table 1 was corrected.

Reviewer 3 Report

We‘d like to thank the reviewer for the constructive feedback. We took action to improve on the mentioned points with special focus on the presentation of results.

… after reading the paper it is not clear which is the genuine method proposed here. …
The method is the 3 step method, described in section 3 that allows training a recurrent neural network using trajectory optimization. We believe some confusion might have been caused by comparing the final controller with to many intermediate controllers, which is why we clarified which is the main controller at the beginning of section 5.

Are author referring to “… Our final recurrent controller of the form in equation (9) is compared with different static controllers and the oracle controllers in terms of different robustness metrics. …”? if that the case, that is not enough for driving the reader to understand the contribution of this paper. The author must clarify this paper because what is oracle controller for? Is an alternative approach to the recurrent controller?

 

...Why do not the authors compare their DOI approach against DAGGER and DART strategies? (from my point of view, this maybe the unique contribution of this work).
It is our belief that it would be an unfair comparison in our favor as neither DAGGER nor DART are particulary suited for this use cases. DOI is not an alternative for DAGGER or DART, but serves a different purpose.
DAGGER is aggregating data to avoid too much use of a (human) supervisor. Since we don‘t use a human supervisor, there is no need to save all past data.
DART is an off-policy method and not suited for training recurrent controller.
We took the on-policy part of DAGGER with the disturbances from DART specifically for our use case.

Thanks to the authors for clarifying this question. But it is still open one question, it is not the first time that an inverted pendulum mobile robot is presented in the literature. The authors must compare their performance with at least a current state method. And if this method is not comparable with DART and DAGGER, why do the authors make the visual comparison between DART, DAGGER and DOI (Figure 1)?

 

First, the authors added noise in the presentation of this paper like whole section 3.1, this section can be compressed by just referring to the readers to the book of “J. Nocedal, S. Wright. Numerical Optimization” …
It is true that there are many books on optimization and we are aware of some. However, since trajectory optimization is a crucial part in this paper, we felt the need to do more than just ask the reader to read a book on the topic. We feel that one page is usefull to most reader and we would like to keep it in the paper.

I understand the viewpoint of the authors, but this distracts readers from the authors contribution.  

... Secondly interesting derivations of objective functions are poorly explained or simply not explained like eq. 13 and 15, where it came all those term from? which are the meaning of the constants?…
We added a description of the cost function and how we chose the constants for the cost functions (13) and (15) that comes right after the equations.

They are still difficult to understand. Why the constant values are in equations (13) and (15) for?

In other hand, there are some incongruence in the formulation purposed in this work (equations and algorithms), for example, in algorithm 1, when is generated x_1?
This was a mistake from our side and we corrected the algorithm to start with t=0.

Thanks for following this detail.

It is not clear derivation of eq. 7 from eq. 6.
It is rather a short notation, and there is no derivation between the two. We clarified this more in the text between both and put u_t, h_t in a vector to be mathematically correct and avoid confusion.

Thanks for this clarification, but still not clear the definition of equation 7. That equation is referring to a data-driving method (like a recurrent neuronal network?) if that is the case, the authors must clarify it.

Although the authors propose various different metrics for evaluate the performance of this framework, they are not clearly presented for being interpreted the results by readers. For example, results in table 2 and 3, their quantities what express? As bigger is the number is better? Really are the methods comparable?
To reduce possible confusions, we repeated in section 5.2 that the metrics represent costs and as such small numbers are better. We also described the intention behind the cost functions used a little more.

Could the authors explicitly tell which are the paragraphs where is said that?

Figures 5, 6 and 7 are not properly discussed, how the three methods are comparable
We now describe the figures more in depth and point out the differences of the controllers more.

Could the authors explicitly tell where are they better described? Perhaps the authors must reconsider to change these figures and put them in a more comparable format.

Further changes:
Many minor changes were made to clarify ideas, to prevent possible confusions or explain more. 
E.g. we more implicitly pointed out differences of our contribution with Peng et al. [9] in section 3. We included a new paragraph between the results and the conclusion „Outlook of application specific variations“ where we shortly describe variations for new possible tasks. We added image sequences to qualitatively show the control behavior in a more accessible way (figure 8 and 9). We added a video link where our controller is shown for stationary and moving targets. We added an appendix repeating the model equations given in Pathak et al [46].

Thanks for the authors disposal to improve the current state of the paper.

Author Response

We thank the reviewer again for his feedback. We hope to have tackled the remaining points of confusion. Line numbers refer to the lines in the old version of the paper, and might differ from the newly uploaded version.

1) ...if that the case, that is not enough for driving the reader to understand the contribution of this paper. The author must clarify this paper because what is oracle controller for? Is an alternative approach to the recurrent controller?

The oracle controller is only an intermediate result that can only be used in simulation or as a teacher for the recurrent controller (not in the application). It is used in DOI to create the target actions. We had this written in line 237-238, 246 (first sentence in 3.3), 263, by using the work „intermediate“ in the lines 182 and 187 and in the headings of table 2.

We now also clarified that it is not an alternative approach as it cannot be used in the application by adding a sentence between section 3 and 3.1 as well as to the beginning of section 5 to make it more clear.

2) Thanks to the authors for clarifying this question. But it is still open one question, it is not the first time that an inverted pendulum mobile robot is presented in the literature. The authors must compare their performance with at least a current state method. And if this method is not comparable with DART and DAGGER, why do the authors make the visual comparison between DART, DAGGER and DOI (Figure 1)?

We though about this, the competing methods that we could think of are nonlinear MPC or even robust nonlinear MPC and reinforcement learning methods, as those also work with cost functions. We are not aware of any analytical method that can handle abruptly changing target positions. The analytic controllers on the MIP (some are listen in lines 320-325) mostly consist in a feedforward+feedback control and do not use a cost function which means they would fall short in our comparison if we actually managed to make them work, as many have only been tested in simulation.
Nonlinear MPC is computationally too expensive to run on our hardware unfortunately, or it would require someone with more experience in how to programm this more efficiently than I can do, therefore it is not included.
For reinforcement learning, we actually tried it before on that system with a different cost function (as we could not use end constraints), and it was unable to converge unfortunately.

We believe that a comparison between MPC, reinforcement learning and our approach would be very interesting indeed, but it is currently not possible on our system.

As for the figure 1, its included, since some of the ideas are similar, even if the use cases of the methods vary.

3) I understand the viewpoint of the authors, but this distracts readers from the authors contribution.

We understand that the section might be less interesting for people that are already very familiar with trajectory optimization. We believe that people from different fields might have interest in reading the work as well though, e.g. people from the reinforcement learning community. Also, since the section is complete under the heading „3.1 Trajectory Optimization“, we believe people that have the necessary background can skip it easily since its not very long. Another reason to introduce the notation (X*, U*, p) etc that is used later. As such, we would still like to include that section.

4) (about cost functions) They are still difficult to understand. Why the constant values are in equations (13) and (15) for?
We extended the explaining sentence at line 364 to „The constant coefficients weight the importance of the different control goals against each other and were hand tuned by trial and error to produce a subjectively appealing behavior of the MIP.„ . Cost function design is to our knowledge either done via inverse optimal control / inverse reinforcement learning (quite rare), or as in our case by manual tuning (most often).

5) Thanks for this clarification, but still not clear the definition of equation 7. That equation is referring to a data-driving method (like a recurrent neuronal network?) if that is the case, the authors must clarify it.

Yes it is in our case a recurrent neural network.This is written just above equation 6. We repeat the fact again above equation 7 now to make this more clear.

6) „To reduce possible confusions, we repeated in section 5.2 that the metrics represent costs and as such small numbers are better. We also described the intention behind the cost functions used a little more.“

Could the authors explicitly tell which are the paragraphs where is said that?

It is in lines 402, 403: „A lower value for $J_{\mathbb{E}, c}$ means that the controller is closer to the optimal trajectories.“

In the line 404, 405: „Again, smaller values are better.“

As for the explanations, there is a sentence either before or after the cost is written in equation, e.g., line 403 „The second metric is the highest accumulated costs subtracted from the optimal accumulated costs with initial states and model parameters from the test-set“.

7) „We now describe the figures more in depth and point out the differences of the controllers more.“

Could the authors explicitly tell where are they better described? Perhaps the authors must reconsider to change these figures and put them in a more comparable format.

The paragraph from line 441-456 that was reworked is dedicated to those figures. The alternative of putting all results in 1 figure displeases us as it makes it even harder to see. We evaluated the numerical values in table 3 to have an quantitative measure as well.
While all three controllers are able to solve the task, we believe that by reading the mentioned paragraph and looking closely to the figures, it should be clear that the adjustable controller achives faster approaching of the target position.

8) Further changes

We corrected some of the units in table 1 that were wrong.

Back to TopTop