This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/).
As a powerful and intelligent machine learning method, reinforcement learning (RL) has been widely used in many fields such as game theory, adaptive control, multiagent system, nonlinear forecasting, and so on. The main contribution of this technique is its exploration and exploitation approaches to find the optimal solution or semioptimal solution of goaldirected problems. However, when RL is applied to multiagent systems (MASs), problems such as “curse of dimension”, “perceptual aliasing problem”, and uncertainty of the environment constitute high hurdles to RL. Meanwhile, although RL is inspired by behavioral psychology and reward/punishment from the environment is used, higher mental factors such as affects, emotions, and motivations are rarely adopted in the learning procedure of RL. In this paper, to challenge agents learning in MASs, we propose a computational motivation function, which adopts two principle affective factors “Arousal” and “Pleasure” of Russell’s circumplex model of affects, to improve the learning performance of a conventional RL algorithm named Qlearning (QL). Compared with the conventional QL, computer simulations of pursuit problems with static and dynamic preys were carried out, and the results showed that the proposed method results in agents having a faster and more stable learning performance.
multiagent system (MAS)computational motivation functioncircumplex model of affectpursuit problemreinforcement learning (RL)1. Introduction
“Reinforcement” is first used by Pavlov in his famous conditioned reflex theory of the 1920s. It explains that the stimulus from the external environment can be categorized as rewards or punishments and they change the nature of the brain and the behavior of animals. The concept of reinforcement has been introduced in artificial intelligence (AI) from the 1950s and, as a bioinspired machine learning method, reinforcement learning (RL) has been developed rapidly since the 1980s [1]. As analyzed by Doya, RL may take place in the basal ganglia of the brain: even the parameters used in RL may involve neuromodulators such as dopamine, serotonin, noradrenaline and acetylcholine [2]. In recent years, RL has been widely used in game theory [1], autonomous robotics [3,4], intelligent control [5,6], nonlinear forecasting [7,8,9,10], multiagent systems (MASs) [11,12,13,14,15,16,17,18], and so on.
In [3], Asada et al. proposed a visionbased RL method to acquire cooperative behaviors of mobile robots in dynamically changing real worlds. In [4], Kollar and Roy combined RL with extended Kalman filter (EKF) to realize trajectory optimization of autonomous mobile robots in an unknown environment. Jouffe proposed a fuzzy inference system with RL to solve nonlinear control problems in [5], meanwhile Obayashi et al. realized robust RL for control systems adopting slide mode control concept in [6]. In our previous works, several kinds of neurofuzzy network types RL systems have been proposed as the time series predictors [7,8,9,10], and the internal models of agents to solve the goaldirected exploration problems in unknown environments [11,12,13,14]. Kobayashi et al. adopted attention degree into an RL named Q learning (QL) for multiagent system (MAS) [15,16,17] acquiring adaptive cooperative behaviors of agents and confirmed the effectiveness of their improved RL by simulations of the pursuit problem (hunting game) [18].
The principle of RL can be summarized as using the adaptive stateaction pairs to realize the optimal state transition process where optimal solution means that the maximum rewards are obtained by the minimum costs. The rewards from the environment are changed to be the values of states, actions, or stateaction pairs in RL. Almost all wellknown RL methods [1,19,20,21,22] use the value functions to change the states of the learner to find the optimized state transitions. However, when the learner (agent) observes the state of environment partially or the state of the environment is uncertain, it is difficult to select adaptive actions (behaviors). For example, in a multiagent system (MAS), i.e., multiple agents exploring an unknown environment, neighborhood information is dynamic and the action decision needs to be given dynamically, the autonomy of agents makes the state of the environment uncertain and not completely observable [11,12,13,14,15,16,17,18]. When RL is applied to MASs, problems such as “curse of dimension” (the explosion of stateaction space), “perceptual aliasing problem” (such as the state transition in partially observable Markov decision process (POMDP)), and uncertainty of the environment come to be high hurdles.
The action selection policy in RL plays a role of motivation of the learner. The learning process of RL is to find the optimal policy to decide the valuable actions during the transition of states. The behavior decision process by RL is clear and logical, based on the reward/punishment prediction. Meanwhile, to decide an action/behavior, high order animals, especially human beings may not only use the thinking brain, i.e., logical judgment, but also the emotional brain, i.e., instinctive response. Recently, neuroscience suggests that there are two paths for producing emotion: a low road is from the amygdala to the brain and body, and it is twice as fast as another high road which carries the same emotional information from the thalamus to the neocortex [23,24]. So it is possible for our brains to register the emotional meaning of a stimulus before that stimulus has been fully processed by the perceptual system, and the initial precognitive, perceptual, emotional processing of the low road, fundamentally, is highly adaptive because it allows people to respond quickly to important events before complex and timeconsuming processing has taken place [23].
So, in contrast with the value based RL, behavior acquisition for autonomous mobile robots have also been approached by computational emotion models recently. For example, the Ide and Nozawa groups proposed a series of emotiontobehavior rules for the goal exploration in unknown environment of plural mobile robots [25,26]. They used a simplified circumplex model of emotion given by Larson and Diener [27] which comes from a circumplex model of affect by Russell [28]. The affect model of Russell [28] is concluded by the statistical categorization of psychological (selfevaluate) experiments, and it suggests eight main affects named “Arousal”, “Excitement”, “Pleasure”, “Contentment”, “Sleepiness”, “Depression”, “Misery”, and “Distress” with a relationship map of circulus. These affect factors are abstracted in a twobasic dimension space with “PleasantUnpleasant” (valence dimension) and “High ActivationLow Activation” (arousal dimension) axes by Larson and Diener [27]. Using these emotional factors, Ide and Nozawa introduced a series of rules to drive robots to pull or push each other to realize the avoidance of obstacles and cooperatively find a goal in an unknown environment [25,26]. To overcome problems such as deadlock and multiple goal exploration when the complex environment was applied, Kuremoto et al. improved the emotiontobehavior rules by adding another psychological factor: “curiosity” in [29,30]. However, all these emotional behavior rules for mobile robots only drive robots to move towards the static goals. Moreover, learning function, which is an important characteristic of intelligent agent, is not equipped with the model. So, agents can explore the goal(s), but cannot find the optimal route to the goal(s).
In this paper, we propose adopting affect factors into conventional RL to improve the learning performance of RL in MAS. We suggest that the fundamental affect factors “Arousal” and “Pleasure” are multiplied to produce an emotion function, and the emotion function is linearly combined with Q function which is the stateaction value function in Qlearning (QL), a wellknown RL method, to compose a motivation function. The motivation function is adopted into the stochastic policy function of RL instead of Q function in QL. Agents select available behaviors not only according to the states they observed from the environment, but also referring to their internal affective responses. The emotion function is designed by calculating the distance from agent to the goal, and the distance between the agent and other agents is perceived in the field of view. So the cooperative behaviors may be generated during the goal exploration of plural agents.
The rest of this paper is structured as follows. In Section 2, Russell’s affect model is introduced simply at first, then, a computational emotion function is described in detail. Combining the emotion function with a wellknown reinforcement learning method “Qlearning” (QL), a motivation function is constructed and it is used as the action selection policy as the improved RL of this paper. In Section 3, to confirm the effectiveness of the proposed method, we applied it to two kinds of pursuit problems: a pursuit problem simulation with a static prey and a simulation with a dynamic prey, and the results of simulations are reported. Discussions and conclusions are stated in Section 4, Section 5 respectively.
2. An Improved Q Learning2.1. Russell’s Affect Model
Though there are various description and models of human being’s emotion given by psychologists and behaviorists, circumplex models are famous and popularly studied [31,32,33]. Wundt proposed a 3D emotion model with dimensions of “pleasuredispleasure”, “arousalcalmness” and “tensionrelaxation” 100 years ago [33]. Psychologists recently distinguish “emotion” and “affect” as “emotions”, which are “valenced reactions to events, agents, or objects, with their particular nature being determined by the way in which the eliciting situation is construed” and “affect means evaluative reactions to situations as good or bad” [34]. In this study, we adopt basic dimensions of affects of circumplex model to invoke the emotional responses for the decision of autonomous agents.
Russell’s circumplex model of affect [28] (See Figure 1) is simple and clear. It is given by a statistic method (principalcomponents analysis) to categorize human being’s affects using 343 subjects’ selfreports judgments. In the circumplex model, eight kinds of affective states including pleasure (0°), excitement (45°), arousal (90°), distress (135°), displeasure (180°), depression (225°), sleepiness (270°) and relaxation (315°) are located in a circular order. The interrelationships of these states are also represented by the arrangement, e.g., high values of arousal and pleasure result in a high value of excitement.
A circumplex model of affect proposed by Russell [27,28].
2.2. Emotion Function
In contrast with using the value function to decide an action/behavior of agents in conventional reinforcement learning methods [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22], we propose using affective factors to invoke emotional response and reaction during the state transition process of agents in multiagent systems (MASs). The reason for this proposal is according to the principle of interacting brain systems: the emotional brain and the thinking brain make decisions cooperatively [23,24]. Here, we use “Arousal” and “Pleasure” of Russell’s basic affects to construct “eliciting situationoriented emotion” of agents. Moreover, the goaldirected problem that needs to be solved by RL is limited to pursuit problems or unknown environment explorations using local observable information.
Now, suppose an eliciting situation for an agent i, i ∈ {1,2,…,N} is e, e ∈ E ≡ {1,2,…,E}, and the corresponding emotion is Emo_{i}(e). When the agent perceives objects with high rewards (e.g., a goal) in the situation e, then Emo_{i}(e) = toPv_{i}(e) (see Equation (1)), where toPv_{i}(e) (see Equation (2)) is a mulptiplication of Arousalav_{i}(e) (see Equation (4)) and PleasurePv_{i}(e) (see Equation (5)). Meanwhile, if the agent finds other agents in the situation e, the emotion of the agent is totally effected by the emotion of other (perceived) agents Emo_{i}(a) = getPv_{i}(a) (see Equation (1)). Consequently, the emotion function of the agent i in the state a is given by Equations (1–5).
where d_{goal}(e) is Eclidean distance from the agent to a goal in enviroment exploration problem or a prey in pursuit problem if the goal is perceivable, and d_{agent}(e) denotes Eclidean distance between agent i and agent j if the agent j is perceivable in situlation e. Depth is the threshold value of perceivable depth of space for sensory stimuli of visual, auditory, or olfactory signals, ∆av, Pv, σ are positive parameters, e_{i} and e_{i}’ are the current and next eliciting situations respectively.
So the formulas designed above can be incorporated into the following inferences rules:
RuleA: Pleasure pv_{i}(e) concerns with the distance between the agent i, i ∈ {1,2,…,N} and the goal (Equation (5)) if the situation e means that the goal is perceived .
RuleB: Arousalav_{i}(e) increases when the eliciting situation is continued (Equation (4)).
RuleC: Emotion state Emo_{i}(e) (Equations (1–3)) of agent i in the eliciting situation e is expressed by the multiplication of Pleasurepv_{i}(e) or pv_{j}(e) (Equation (5)) and Arousal av_{i}(e) or av_{j}(e) (Equation (4)).
RuleD: Emotion state Emo_{i}(e) is constructed and changed by stimuli from objects and events: perceiving the goal or other agents dynamically (Equations (1–5)).
2.3. A Motivation Function
When the Q learning algorithm (QL) [1,21] of the convetional reinforcement learning (RL) is used to find the optimal solution of a Markov decision process (MDP) which describes the stochastic state transition of agents, a stateaction value function Q_{i}(s,a) is calculated as follows:
where s ∈ R^{N×E} is a state oberved by the agent i, a ∈ R^{E} is an action selected by the agent according to the value of Q_{i}(s,a) (higher Q_{i}(s,a) serves higer probability of action a at the state a), s’, a’ are the next state and selectable action, and r is a scalar of reward from the environment, 0 < α < 1 is a learning rate, and 0 < γ < 1 a discount constant.
MDP, defined as the transition of states that only depends on the current state and the selected action [1], showed good convergence for the problems in MDP. However, in many realworld cases, Markov assumption is not satisfied. For example, when a learner observes its local environment or the state indirectly, different actions need to be selected for the optimal state transition. This case is called partially observable Markov decision process (POMDP) [1]. In MASs, an agent even needs to decide its optimal action according to the decision of other agents, so the problem of MASs belongs to nonMarkov nature. To improve QL for POMDP and nonMarkovian problems, approaches to change the environment to be MDPlike have been proposed, such as using belief state which uses the information of transition history [1,21], MonteCarlo policy evaluation [35], and the different rewards of agents [36]. Furthermore, additional information such as the relationship between agents is also adopted into stateaction value function in [18].
Here, agents learning of a MAS, we propose that it not only is decided by the value of observed state of environment, i.e., Equation (6), but also situationoriented emotion of agent given by Equations (1–5). A motivation function of agent i to express stateemotionaction value is proposed as follows.
where 0 ≤ L ≤ 1.0 is a constant to adjust the balance of emotional response Emoi(e) and knowledge exploitation Q_{i}(s,a). Specifically, the action a, a ∈ A ≡ {1,2,…,A} denotes the action of agent i in each available situation e, A = E here.
2.4. Policy to Select an Action
As for the traditional QL, we use the softmax seletion policy to decide an action according to the Bolzmann (Gibbs) distribution [1].
where p_{i}(as,Emo(e)) is the probability of selecting action a at state s and emotion Emo(e) of an eliciting situation e. T is a positive constant called “temperature” in the Blzmann distribution function. Higher T encourages the agent to select an available action more randomly. Here, we suggest using T ← T^{Episode}, 0 < T < 1, Episode = 1,2,…Episode to reduce T according to the increase of learning iteration (Episode) to obtain stable learning convergence.
2.5. Learning Algorithm
The improved QL algorithm then can be given as follows:
Initialize Q_{i}(s,a) = 0; av_{i }= 0 for each agent i, i ∈ {1,2,…,N} and all obserable states s ∈ S and actions.
Repeat following in each episode.
Return to the initial state.
Repeat the following until the final state.
Observe the state s_{i} ∈ S of the environment, judge the situation e_{i} ∈ E if the goal or other agents appear in perceivable area, calculate emotion Emo_{i}(e) and motivation M_{i}(s, Emo_{i}(e), a) at the state s_{i} and situation e_{i} of each selectable actions, select an action a_{i} ∈ A according to the stochastic policy (Equation (8)).
Execute the action a_{i}; observe a reward r and next state s_{i}’.
Renew Q_{i}(s,a) according to Equation (6).
s_{i} ← s_{i}’
Stop if episodes are repeated enough.
3. Computer Simulation3.1. Definition of Pursuit Problem
To confirm the effectiveness of the proposed method, computer simulations of pursuit problems were executed. In Figure 2, a case of 2hunter1staticprey problem with its initial state of the exploration environment is shown. The size of the environment is 17 × 17 grids. Prey (ridstays at the position (9, 9) at all times, while two hunters (○) find and capture the prey from (2, 2) and (14, 14) respectively. The perceivable area of a hunter is a square field around the hunter as shown in Figure 2, and the depth limitation was set to be 4 grids in 4 directions. So, the global coordinate information was not used but local observable information, the environment belongs to PDMDP.
The pursuit problem, or hunting game, can be designed in various conditions [18], and for convenience, it is defined as a static or dynamic goal (prey) that needs to be captured by all plural agents (hunters) simultaneously as one of the final states, as shown in Figure 3.
A hunter, which is an autonomous agent, observes its local environment as a state in five dimensions: Wall or Passage in four directions (up, down, left, and right) and Prey information in one dimension. For the first four dimensions, there are three discrete values: near, far, and unperceivable. Prey dimension has five values: up, down, left, right, and unperceivable. So the number of states s_{i} observed by a hunter i is 4 × 3 × 5 = 60. The number of action is designed as A = 4 (up, down, left, right), and the number of stateaction value function values Q_{i}(s,a) (Q table called in QL) is 60 × 4 = 240. Additionally, the emotional response Emo_{i}(e) to the eliciting situations, i.e., perception of prey or other agents, are also in four directions, as for the action: up, down, left, and right E = 4. So the number of motivation function M_{i}(s, Emo_{i}(e), a) is 240 × 4 = 960.
An environment for 2hunter1staticprey problem.
Available final states of a 2hunter1staticprey problem. Prey (le final states of a 2hunter1staticpr.
The learning performance of the Q learning method and the proposed method were compared for the pursuit problem in simulation. A part of the setting of simulations and parameters are listed in Table 1, Table 2. Special parameters used in the proposed method are listed in Table 3.
robotics0200149t001_Table 1
Common conditions in the simulations.
Description
Symbol
Value
Episode limitation
Episode
200 times
Exploration limitation in one episode
step
10,000 steps
Size of the environment
X
×
Y
17 × 17 grids
Threshold (depth) of perceive field
Depth
4 grids
Number of hunter
i
2, 3, 4
Number of action/situation
a/e
4
robotics0200149t002_Table 2
Parameters used in the simulation of 2hunter1staticprey problem.
Parameter
Symbol
Q learning
Proposed method
Learning rate
α
0.9
0.9
Damping constant
γ
0.9
0.9
Temperature (initial value)
T
0.99
0.994
Reward of prey captured by 2 hunters
r_{1}
10.0
10.0
Reward of prey captured by 1 hunter
r_{2}
1.0
1.0
Reward of one step movement
r_{3}
−0.1
−0.1
Reward of wall crash
r_{4}
−1.0
−1.0
robotics0200149t003_Table 3
Parameters of the proposed method used in the simulations.
Parameter
Symbol
Value
Coefficient of
Emo
L
0.5
Coefficient of
Pleasure
Pv
200.0
Initial value of
Arousal
Av
1.0
Modification of
Arousal
Δav
0.01
Constant of Gaussian function
σ
8.0
3.2. Results of Simulation with a Static Prey
Both QL and the proposed method achieved the final states when learning process converged in the simulation. Figure 4 shows the different tracks of two hunters given by conventional QL (Figure 4(a)) and the proposed method (Figure 4(b)) where “S” denotes the start position of hunters, “○” denotes the start position of hunters and “●” is the position of the static prey. The convergence of the proposed method was faster and better than Q learning, as shown as in Figure 5 where “Q” indicates conventional QL results, and “M” is by the proposed motivation function used method.
Comparison of different tracks of two hunters captured a static prey. (a) Results of Q learning (QL); (b) Results of the proposed method.
Comparison of the change of exploration costs (the number of steps from start to final states) during learning process in 2hunter1staticprey problem simulation (average of 10 simulations).
The average steps of one episode (from start to final states) during total learning process (200 episodes) given by 10 times simulations of the conventional QL and the proposed method were 72.9 and 17.1 as it shown in the first row of Table 4. Similar simulations using three and four hunters to find/capture a static prey were also performed and the learning performance of different methods was compared in Figure 6, Table 4. The proposed method was confirmed to be superior in terms of learning performance, i.e., it involves shorter steps to capture the prey in all simulations.
robotics0200149t004_Table 4
Average exploration steps during learning in static prey problem simulations (200 episodes, 10 trials).
Number of hunter
Q learning
Proposed method
2
72.9
17.1
3
41.1
15.0
4
50.7
18.6
Comparison of average exploration steps during learning in static prey problem simulations (200 episodes, 10 trials).
3.3. Results of Simulation with a Dynamic Prey
When a prey is not static, i.e., moves during the exploration process, the learning of hunters’ adaptive actions become more difficult. We designed a simulation of dynamic prey problem with a periodic movement process of the prey. The simulation was performed at the same start state, as shown in Figure 3. After the start state, the prey moved one step to the left before they arrived at the left wall. Then, it returned to the right direction until the right wall, and repeated the leftright movement periodically.
Table 5 gives the parameters used in a simulation of 2hunters1dynamicprey problem, and in Figure 7, the difference between the learning processes of different methods was shown, where “Q” means the result of conventional QL, and “M” depicts the change of exploration cost (steps) during learning (episode) using the proposed method.
robotics0200149t005_Table 5
Parameters used in the simulation of 2hunter1dynamicprey problem.
Parameter
Symbol
QL
Proposed method
Learning rate
α
0.9
0.9
Damping constant
γ
0.9
0.9
Temperature (initial value)
T
0.99
0.99
Reward of prey captured by 2 hunters
r_{1}
10.0
10.0
Reward of prey captured by 1 hunter
r_{2}
1.0
1.0
Reward of one step movement
r_{3}
−0.1
−0.1
Reward of wall crash
r_{4}
−1.0
−1.0
Coefficient of
Emo
L

0.5
Coefficient of
Pleasure
Pv

10.0
Initial value of
Arousal
Av

1.0
Modification of
Arousal
Δav

0.1
Constant of Gaussian function
σ

8.0
Comparison of the change of exploration costs (the number of steps from start to final states) during learning process in 2hunter1dynamicprey problem simulation (average of 10 simulations).
In Figure 8, the trajectories of two hunters that pursued the dynamic prey were shown. In contrast to Figure 6, the trajectories of two hunters in Figure 8 were shown where they pursued the prey who moved to left or right directions (the broken lines). The trajectories were the solutions after 200 training (episodes) by different methods: Figure 8(a) shows conventional QL and Figure 8(b) the proposed method. The length of each trajectory was the same: 13 steps (the theoretical optimal solution is 11 steps).
Comparison of different tracks of two hunters capturing a dynamic prey. (a) Results of Q learning (QL); (b) Results of the proposed method.
Additionally, all average exploration steps during learning of different numbers of hunters for the dynamic prey problem are summarized in Table 6. As a result, the proposed method showed higher efficiency than conventional QL, and the effectiveness was more obvious than the case of static prey problems (Also see Figure 9).
robotics0200149t006_Table 6
Average exploration steps during learning in dynamic prey problem simulations (200 episodes, 10 trials).
Number of hunter
Q learning (QL)
Proposed model (DEM)
2
202.7
24.4
3
65.1
13.3
4
96.0
15.3
Comparison of average exploration steps during learning in dynamic prey problem simulations (200 episodes, 10 trials).
4. Discussions
As the results of pursuit problem simulations, the proposed learning method which combined QL and affect factors enhanced the learning efficiency compared to the conventional QL. Parameters used in all methods were optimal values from experiments. For example, different values of balance coefficient L in Equation (7) may yield different learning performance. L = 0.5 was used either in static prey simulation as shown in Table 3, or in dynamic prey simulation as shown in Table 5 because the optimal value yielded the shortest path for each simulation. In Figure 10, a case of a simulation with three hunters and one dynamical prey was shown. The value of L increased from 0 to 1.0 by a difference of 0.1 in the horizontal axis, and the lowest value of the vertical axis shows the average length (steps) of the 10 simulations of the capture process to be 13.11 steps at L = 0.8, so L = 0.8 was used as the optimal parameter value in this case.
Comparison of the change of exploration costs (the number of steps from start to final states) by different QM balance coefficient L in Equation (7) during learning process in 3hunter1dynamicprey problem simulation (average of 10 simulations).
It is interesting to investigate how Arousalav and Pleasurepv values changed during the exploration and learning process in the simulation of pursuit problem. In Figure 11, the change of these affect factors in the 2hunter1dynamicprey simulation is depicted. In Figure 11(a), Arousal values of Hunters 1 and 2 changed together in the first threestep period, and then separated. This suggests that the state of the local environment observed by Hunter 1 changed more dramatically than the situation of Hunter 2, so Arousalav of Hunter 1 dropped according to the exploration steps. This is the result of Equation (4), the definition of Arousal av. In contrast, in Figure 11(b), Pleasure pv of Hunter 1 rose steeply to high values from step 2, corresponding to the dramatic change of the observed state: it might find and move to the prey straight ahead. From the 9th step, the Pleasure value of Hunter 2 also rose to high values for finding the prey or perceiving the high Pleasure value of Hunter 1.
From this analysis of the change of affect factors, it can be judged that the proposed method worked efficiently in the reinforcment learning process and it results in the improvement of learning performance of the MAS.
The change of Arousalav and Pleasurepv values of two hunters during the exploration in dynamic prey problem simulation. (a) The change of Arousalav; (b) The change of Pleasurepv.
5. Conclusions and Future Works
To improve the learning performance of reinforcement learning for multiagent systems (MASs), a novel Qlearning (QL) was proposed in this paper. The main idea of the improved QL is the adoption of affect factors of agents which constructed “situationoriented emotion” and a motivation function which is the combination of conventional stateaction value function and the emotion function.
Compared with the conventional QL, the effectiveness of the proposed method was confirmed by simulation results of pursuit problems with static and dynamic preys in the sense of learning costs and convergence properties.
The fundamental standpoint of this study to use affective and emotional factors in the learning process is in agreement with Greenberg [24]: “Emotion moves us and reason guides us”. Conventional QL may only pay attention to “reasoning” from “the thinking brain” [24], or basal ganglia [2]. Meanwhile, in the proposed method we also considered the role of “the emotion brain” [23], amygdala or limbic system [24].
Therefore, as expected, future works will identify functions such as the neurofuzzy networks [9,10,11,12,13,14], and more emotion functions such as fear, anger, etc. [31,32,37,38], and other behavior psychological views [39] may be added to the proposed method. All these function modules may contribute to a higher performance of autonomous agents and it is interesting to apply these agents to develop intelligent robots.
Acknowledgments
A part of this study was supported by Foundation for the Fusion of Science and Technology (FOST) 20122014 and GrantinAid for Scientific Research (No. 23500181) from JSPS, Japan.
Conflict of Interest
The authors declare no conflict of interest.
ReferencesSuttonR.S.BartoA.G.DoyaK.Metalearning and neuromodulationAsadaM.UchibeE.HosodaK.Cooperative behavior acquisition for mobile robots in dynamically changing real worlds via visionbased reinforcement learning and developmentKollarT.RoyN.Trajectory optimization using reinforcement learning for map explorationJouffeL.Fuzzy inference system learning by reinforcement learningObayashiM.NakaharaN.KuremotoT.KobayashiK.A robust reinforcement learning using concept of slide mode controlKuremotoT.ObayashiM.YamamotoA.KobayashiK.Predicting Chaotic Time Series by Reinforcement LearningKuremotoT.ObayashiM.KobayashiK.Nonlinear prediction by reinforcement learningKuremotoT.ObayashiM.KobayashiK.Forecasting Time Series by SOFNN with Reinforcement LearningKuremotoT.ObayashiM.KobayashiK.Neural forecasting systemsKuremotoT.ObayashiM.KobayashiK.AdachiH.YonedaK.A Reinforcement Learning System for Swarm BehaviorsKuremotoT.ObayashiM.KobayashiK.Swarm behavior acquisition by a neurofuzzy system and reinforcement learning algorithmKuremotoT.ObayashiM.KobayashiK.AdachiH.YonedaK.A neurofuzzy learning system for adaptive swarm behaviors dealing with continuous state spaceKuremotoT.ObayashiM.KobayashiK.An improved internal model for swarm formation and adaptive swarm behavior acquisitionSycaraK.P.Multiagent systemsMataricJ.Reinforcement learning in multirobot domainMakarR.MahadevanS.Hierarchical multi agent reinforcement learningKobayashiK.KuranoT.KuremotoT.ObayashiM.Cooperative behavior acquisition using attention degreeBartoA.G.SuttonR.S.AndersonC.W.Neuronlike adaptive elements that can solve difficult learning control problemsSuttonR.S.Learning to predict by the method of temporal differenceWatkinsC.DayanP.Technical note: QlearningKondaV.R.TsitsiklisJ.N.Actorcritic algorithmsLeDouxJ.E.GreenbergL.Emotion and cognition in psychotherapy: The transforming power of affectSatoS.NozawaA.IdeH.Characteristics of behavior of robots with emotion modelKusanoT.NozawaA.IdeH.Emergent of burden sharing of robots with emotion model (in Japanese)LarsenR.J.DienerE.Promises and problems with the circumplex model of emotionRussellJ.A.A circumplex model of affectKuremotoT.ObayashiM.KobayashiK.FengL.B.Autonomic behaviors of swarm robots driven by emotion and curiosityKuremotoT.ObayashiM.KobayashiK.FengL.B.An improved internal model of autonomous robot by a psychological approachRussellJ.A.Feldman BarrettL.Core affect, prototypical emotional episodes, and other things called emotion: Dissecting the elephantRussellJ.A.Core affect and the psychological construction of emotionWundnW.OrtonyA.CloreG.CollinsA.JaakkolaT.SinghS.P.JordanM.I.Reinforcement learning algorithm for partially observable Markov decision problemsAgoginoA.K.TumerK.Quicker QLearning in MultiAgent SystemsAvailable online:http://archive.org/details/nasa_techdoc_20050182925(accessed on 30 May 2013)AugustineA.A.HemenoverS.H.LarsenR.J.ShulmanT.E.Composition and consistency of the desired affective state: The role of personality and motivationWatanabeS.ObayashiM.KuremotoT.KobayashiK.A New DecisionMaking System of an Agent Based on Emotional Models in MultiAgent SystemAleksanderI.Designing conscious systems