Next Article in Journal
Comparison of Lead-Acid and Li-Ion Batteries Lifetime Prediction Models in Stand-Alone Photovoltaic Systems
Next Article in Special Issue
A Comparison of Time-Series Predictions for Healthcare Emergency Department Indicators and the Impact of COVID-19
Previous Article in Journal
Developing a Digital Twin and Digital Thread Framework for an ‘Industry 4.0’ Shipyard
Previous Article in Special Issue
NowDeepN: An Ensemble of Deep Learning Models for Weather Nowcasting Based on Radar Products’ Values Prediction
 
 
Article
Peer-Review Record

Anticipatory Classifier System with Average Reward Criterion in Discretized Multi-Step Environments

Appl. Sci. 2021, 11(3), 1098; https://doi.org/10.3390/app11031098
by Norbert Kozłowski *,† and Olgierd Unold
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Reviewer 3: Anonymous
Appl. Sci. 2021, 11(3), 1098; https://doi.org/10.3390/app11031098
Submission received: 28 October 2020 / Revised: 12 January 2021 / Accepted: 16 January 2021 / Published: 25 January 2021
(This article belongs to the Special Issue Applied Machine Learning)

Round 1

Reviewer 1 Report

The authors refine the commonly used discounted reward goal in anticipatory
learning classifier systems to an average reward goal.

I like how the authors present learning classifier systems in the introduction making them accessible to a wide range of readers.

I have three areas of critique, i) the selection of test environments and other algorithms for comparison, ii) related, the meaning of the results and iii) the presentation of the technical part of the paper. I would invite the authors to a major revision in order to resolve these issues.

Let me become concrete.

 

I) On the selection of test environments and other algorithms for comparison

In the introduction the authors write that in many real-world situations, the discounted sum of rewards is not the appropriate option and an average reward criterion should be used, as e.g., in controlling the flow of communication networks.

However, the test environments the authors use are all episodic tasks where the agents tries to reach an end goal. From my perspective a discounted reward setting makes a lot of sense in those environments. I encourage the authors to elaborate more on the choice of these environments for an undiscounted reward setting. Also, I would welcome a general explanation why the agents move towards the goal if their inactiveness is not penalized by a discount factor. Further I would welcome the demonstration of the proposed algorithm at other environments where an undiscounted reward setting might be more appropriate, for example the harvesting of a renewable resource.

Another aspect related to the tests concerns the selection of algorithms the authors compare their method to. It does not become clear to me why the author have chosen established discounted reward algorithms, like Q-leaning (at what discount factor?). Maybe the authors can elaborate on the reasons for their choice. I would encourage the authors to compare their methods to more related systems, i.e. learning classifier systems that already use an average rewards, such as XCSAR.

 

II) On the meaning of the results

The discussion starts with the statement that anticipatory classifier systems with averaged reward criterion can be used in multi-step environments. I may misjudge the significance of this result. If I do, I would encourage the authors to elaborate more why this results is a novel and relevant finding. Further I would encourage the authors to elaborate more on the specifics where the average reward criterion may come in useful compared to the more standard discounted case. To put in bluntly, I would be interested in a discussion why the authors did their work.

 

III) On the presentation of the technical part of the paper.

While I very much like the introduction and the general style of the paper being quite accessible I would encourage the authors to revise the Materials and Methods part to become a concise yet descriptive presentation of their presented method. For example, is the overall agent-environment system a Markov Decision Process or something different? That would help readers like me with a Reinforcement Learning background a lot.

Furthermore, I encourage the authors to overthink their use of abbreviations. Generally, I would welcome only a few of them in the paper, only after they have been properly introduced and none in the title. This helps the readability of the paper enormously. For example, ACS2 in the title (and in the introduction), LCS and AACS2 in the abstract, ALP could be spelled out also after introducing the abbreviation, references to (Woods1, Maze6, and Woods14) could be given where they are first mentioned.

Specific comments:

Sec 2.1: I would welcome a discussion how the idea of anticipatory learning differs from temporal-difference learning (like Q-learning).

L. 87: How can a case be useless? No change in perception would also be a valuable information.

Fig 2 would benefit greatly from a more elaborate caption that describes all variables in the Figure. What does it mean that the action set is generated?

L. 111: r is typically not the expected reward, but the immediate reward. If the authors mean the expected reward I would welcome a more elaborate discussion.

L. 114: Typo: small t at beginning of the sentence.

Eq. 5: What do the ts in the subscripts vs the brackets mean?

Fig. 3: Where is the terminating state?

L. 218: Why is the observation vector compacted to length 8 if used with the alphabetic coding {0, F, .}?

Fig. 7: ACS2 is not visible. Different line and symbol styles could be used to indicated that it lies behind Q-learning.

Author Response

The authors refine the commonly used discounted reward goal in anticipatory
learning classifier systems to an average reward goal.

I like how the authors present learning classifier systems in the introduction making them accessible to a wide range of readers.

I have three areas of critique, i) the selection of test environments and other algorithms for comparison, ii) related, the meaning of the results and iii) the presentation of the technical part of the paper. I would invite the authors to a major revision in order to resolve these issues.

Let me become concrete.

 

I) On the selection of test environments and other algorithms for comparison

In the introduction the authors write that in many real-world situations, the discounted sum of rewards is not the appropriate option and an average reward criterion should be used, as e.g., in controlling the flow of communication networks.

However, the test environments the authors use are all episodic tasks where the agents tries to reach an end goal. From my perspective a discounted reward setting makes a lot of sense in those environments. I encourage the authors to elaborate more on the choice of these environments for an undiscounted reward setting. Also, I would welcome a general explanation why the agents move towards the goal if their inactiveness is not penalized by a discount factor. Further I would welcome the demonstration of the proposed algorithm at other environments where an undiscounted reward setting might be more appropriate, for example the harvesting of a renewable resource.

We agree with the statement that the discounted algorithms make more sense in the episodic environments that were used in the paper. However, Mahadevan [1] stated that in RL the discounting algorithms are really useful in cyclic tasks where the cumulative reward sum can be unbounded. Besides that, all tasks can be solved also by using undiscounted algorithms. Such versatility was not examined in the anticipatory classifier systems before.

Three environments (Corridor, FSW, Woods) were chosen mainly because they are standard LCS testbeds. The most advanced environment used in the paper (Woods) enables the generalization to take place and therefore can reduce the total number of rules needed to create an optimal policy. The AACS2 worked, however still not as good as Q-Learning and R-learning that were used as benchmarks.

From our knowledge, no research effort was yet undertaken to adopt Anticipatory Classifier Systems for infinite-horizon tasks where there are no absorbing goal states. That might point to another area of research on its own because changes in the algorithm would be required as well.


An explanation of why the agent moves toward the goal at all can be found in equation 8 - it is able to find the next best action by using the best classifiers' fitness from the next match set. We will emphasize this fact in the text.

Mahadevan in his paper [1] used two environments for evaluating the average reward RL - a stochastic grid world domain and a simulated robot environment. The latter one is a cyclic testbed and is not suitable for the ACS2. This grid environment was extended with a one-way membrane (allowing the agent to pass through it, but not in the other way around). But, our previous experiments showed that the ACS2 agent is not able to solve the grid environments properly. The ALP module is capable of creating classifiers with correct anticipations (latent learning) but the RL module fails with assigning and distributing proper rewards among them. We also agree that an extra and dedicated environment with the features of (1) being able to anticipate the following state (2) favoring the average reward criterion would be very useful.

Another aspect related to the tests concerns the selection of algorithms the authors compare their method to. It does not become clear to me why the author have chosen established discounted reward algorithms, like Q-leaning (at what discount factor?). Maybe the authors can elaborate on the reasons for their choice. I would encourage the authors to compare their methods to more related systems, i.e. learning classifier systems that already use an average rewards, such as XCSAR.

For the benchmarking algorithms, we selected both Q- and R-learning as a baseline for comparison. The ACS2 for the RL uses the implementation of Q-learning, and we thought it might serve two purposes (1) validate if the AACS2 modifications work as expected and (2) enables the reader to examine the difference between using standard and well-known RL algorithms and LCS systems. Using a system like XCSAR would also be interesting but the results would not be still directly comparable. XCS system uses another structure of classifiers and has another heuristics for creating them. In our opinion, an extension towards other anticipatory classifier systems (like YACS, MACS) would be better, but that was marked as future work.

II) On the meaning of the results

The discussion starts with the statement that anticipatory classifier systems with averaged reward criterion can be used in multi-step environments. I may misjudge the significance of this result. If I do, I would encourage the authors to elaborate more why this results is a novel and relevant finding. Further I would encourage the authors to elaborate more on the specifics where the average reward criterion may come in useful compared to the more standard discounted case. To put in bluntly, I would be interested in a discussion why the authors did their work.

The rationale for doing the research was partially presented in the previous section. To summarize we wanted to prove that another reward assignment method (that can be more generally used) can be adopted in the realm of anticipatory classifier systems.

The initial idea was related to the fact that ACS2 was unable to solve Grid like environments [2]. The reward estimation for classifiers in a certain state was distinguishable - we thought that the introduction of a more robust way of distributing reward with long-action chains (like an average reward criterion in this case) would help to deal with this issue.

III) On the presentation of the technical part of the paper.

While I very much like the introduction and the general style of the paper being quite accessible I would encourage the authors to revise the Materials and Methods part to become a concise yet descriptive presentation of their presented method. For example, is the overall agent-environment system a Markov Decision Process or something different? That would help readers like me with a Reinforcement Learning background a lot.

Furthermore, I encourage the authors to overthink their use of abbreviations. Generally, I would welcome only a few of them in the paper, only after they have been properly introduced and none in the title. This helps the readability of the paper enormously. For example, ACS2 in the title (and in the introduction), LCS and AACS2 in the abstract, ALP could be spelled out also after introducing the abbreviation, references to (Woods1, Maze6, and Woods14) could be given where they are first mentioned.

Specific comments:

Sec 2.1: I would welcome a discussion how the idea of anticipatory learning differs from temporal-difference learning (like Q-learning).

L. 87: How can a case be useless? No change in perception would also be a valuable information.

Fig 2 would benefit greatly from a more elaborate caption that describes all variables in the Figure. What does it mean that the action set is generated?

L. 111: r is typically not the expected reward, but the immediate reward. If the authors mean the expected reward I would welcome a more elaborate discussion.

L. 114: Typo: small t at beginning of the sentence.

Eq. 5: What do the ts in the subscripts vs the brackets mean?

Fig. 3: Where is the terminating state?

L. 218: Why is the observation vector compacted to length 8 if used with the alphabetic coding {0, F, .}?

Fig. 7: ACS2 is not visible. Different line and symbol styles could be used to indicated that it lies behind Q-learning.

Comments:

    • Sec 2.1. Difference between ALP and TD. I'm afraid ALP and TD are quite a different concept. ALP mechanism is based on a psychological framework of "Anticipatory Behavioral Control". It's basically a way of comparing previous and current state environmental state and based on that forming IF-THEN rule. The reward distribution assignment is happening in a different module later on.
    • L87. Useless case. The useless case (when no change in environment occurred after executing an action) was introduced in an early version of ACS by Stolzmann. The quality of the classifier decreased then. The reason for doing so was that each classifier in a population should be predicting a change. Later in ACS2 Butz changed this behavior by having only the expected and unexpected case and therefore allowing such classifiers in the population.
    • L111. Classifier cl in ACS2 has two properties - cl.r and cl.ir. The first one is the estimated reward in the long-run, the second one the immediate reward obtained after executing an action.
    • L114. Fixed typo
    • Eq 5. Right. The subscriptions are misleading. We fixed this equation by modifying the equation symbols.
    • Fig 3. In Sec 2.4.1 it's mentioned that the terminating state is s = 1.0. We will add this info also in the figure caption.
    • L218. The presented Woods environment can return the agents' perception in two ways. Eight surrounding cells can be represented as a string of length 8 (like 'OOO..OOOO') or by using a binary representation where each symbol is mapped into 2 bits - therefore extending it for example to '000000010100000000'). The ACS2 algorithm can deal with both representations but the first is more memory efficient since less data is kept as classifier IF-THEN rules.
    • Fig 7. Added a caption mentioning that ACS2 and Q-learning behave in the same way.

We've listed all major changes made (also according to your comments).

Changes:

    • emphasize that the next steps are calculated mainly by using the best fitness from the next match set
    • explicitly stated that all tested environments are Markovian
    • more descriptive caption for Figure 2, 3, 7
    • more readable equation 5

References:

    1. Mahadevan, Sridhar. "Average reward reinforcement learning: Foundations, algorithms, and empirical results." Machine learning 22.1-3 (1996): 159-195.
    2. Kozlowski, Norbert, and Olgierd Unold. "Investigating exploration techniques for ACS in discretized real-valued environments." Proceedings of the 2020 Genetic and Evolutionary Computation Conference Companion. 2020.

 

Reviewer 2 Report

The paper addresses the real-life decision-making problems that aim at maximizing of the average of successive rewards. The authors have considered the anticipatory classifier systems that are a branch of the classical learning classifier systems. They introduced the average reward criterion in order to learn a predictive schema model of a given environment. The manuscript is well-structured, provides a comprehensive introduction, analyzes the proposed solution in detail and verifies it using diverse multi-step scenarios.

First, the anticipatory learning classifier systems were introduced as well as the reinforcement learning and various reward criteria were described in detail. Next, the proposed solution was formed and examined using three diverse sequential environments with increasing difficulty. The numerical studies verified the effectiveness of the AACS2 and the correlations among the system parameters have been noticed and discussed. Hence, the work provides interesting and valuable contribution.

I appreciate the authors' contribution to the area of the learning classifier systems, however, I believe a few improvements are necessary in order to provide a high-quality paper suited for that kind of journals:

- (major) The performed literature review provides a clear introduction to the considered research area, however, the latest publications related to the learning classifier systems should definitely be mentioned. Now, only 20% of all the cited papers were published in the last 10 years.
- (minor) Figures 6, 8, 10 and 11 may be divided into sub-figures (a), (b), etc. that would allow them to be more precisely described in the captions.

Author Response

We additionally mentioned recent advancements in the LCS area such as:

 

  • latest ACS2 extensions (BACS, PEPACS),
  • variants for handling large amounts of data (BioHEL, ExSTraCS 2.0),
  • fusion with neural networks,
  • yearly IWLCS research overview

 

We were also thinking about creating sub-figures from the mentioned plots but finally came to the conclusion that because some things (like the legend and abscissa) are shared it would be more convenient for the reader to compress them into a single figure.

Reviewer 3 Report

This paper introduces methods that improve ACS2, by changing the update rule that is motivated by the r-learning (AACS2-v1) and XCSAR (AACS2-v2). The proposed methods have a payoff landscape similar to r-learning, and in the Woods1 environment, both AACS2-v1 and AACS2-v2 found better behavior policy than ACS2.

However, in simpler environments such as Corridor-20 and FSW-10, the policy from AACS2 showed similar performance to ACS2. Additional experiments in more complex environments would show the superiority of AACS2.

In the Steps in trial graphs (Figure 6, Figure 8, and Figure 10), the graphs do not include q-learning, while in the payoff landscape (Figure 7, Figure 9, and Figure 12), there is a comparison with q-learning and r-learning. Adding a comparison with q-learning and r-learning in Figure 6, 8, 10 would improve the quality of the paper.

Other suggestions:

In Section 2.3, some variables use two or more different notations, e.g. r(t+1), r_{t+1}(s); cl.r, cl_r; cl.ir, cl_{ir}. The notations are confusing, especially cl.r_{t+1}(t) in equation (5). Variable cl.q (the classifier's quality) needs more explanation.

Any explanation on why the payoff landscape graphs (Figure 7, Figure 9) show a step-like plot.

Author Response

Below are the responses to review suggestions.

 

Plots showing the number of steps in the trial and the estimated average (Figure 6, 8, and 10) do not include Q-learning and R-learning because they were trained differently than the anticipatory classifier systems (ACS2 and AACS2). In Q-learning and R-learning the exploitation phase is determined at each step using the epsilon parameter but in ACS2/AACS2 the explore and exploit phases alternate in each run. Therefore a change in testing policy would be required to estimate the number of steps in each exploit phase. And obviously, the estimated average variable is not available in Q-learning. Our idea was to compare the AACS2 modification with traditional algorithms mainly in terms of generated rewards (payoff landscape), and not necessarily in terms of overall exploitation performance (which of course also can be done).

 

Notations used in this paper mixes the conventions used in reinforcement learning and learning classifier systems. Because both bucket-brigade and Q-learning algorithms in ACS uses classifiers from previous and current time-step this modified notation is used (section 2.3 describes how to interpret it).

 

The shape of pay-off plots (step-like) is mentioned in the Discussion section (line 296), however, I will also expand it with a more detailed explanation.

 

Changes:

  • Unified notation of cl.r and cl.ir (Section 2.3)
  • Extended Discussion by mentioning step-like pay-off plots (Section 4)

Round 2

Reviewer 1 Report

The authors addressed all my comments.

Author Response

Thank you for the review.

Reviewer 3 Report

I asked the author to make some corrections. However, there are two parts that have not been modified.
I think the following must be revised.

1. I would like to see a comparison of the performance of R-learning and AACS, even if it is a rough comparison.
2. The 169th line cl_r and cl_ir need to be corrected to cl.r and cl.ir.

 

Author Response

We've run all the experiments additionally using both the Q-Learning and R-Learning algorithms. Figures 6, 8, and 10 compare all the approaches in a consistent way.

We've also fixed the second issue (regarding the notation) and some minor details (like a brief explanation of AXCS and XCSG modifications in the Introduction).

Round 3

Reviewer 3 Report

I believe that the paper is ready for publication

Back to TopTop