Multi-Agent Collaborative Target Search Based on the Multi-Agent Deep Deterministic Policy Gradient with Emotional Intrinsic Motivation

: Multi-agent collaborative target search is one of the main challenges in the multi-agent ﬁeld, and deep reinforcement learning (DRL) is a good way to learn such a task. However, DRL always faces the problem of sparse reward, which to some extent reduces its efﬁciency in task learning. Introducing intrinsic motivation has proved to be a useful way to make the sparse reward in DRL. So, based on the multi-agent deep deterministic policy gradient (MADDPG) structure, a new MADDPG algorithm with the emotional intrinsic motivation name MADDPG-E is proposed in this paper for the multi-agent collaborative target search. In MADDPG-E, a new emotional intrinsic motivation module with three emotions, joy, sadness, and fear, is designed. The three emotions are deﬁned by corresponding psychological knowledge to the multi-agent embodied situations in an environment. An emotional steady-state variable function H is then designed to help judge the goodness of the emotions. Based on H , an emotion-based intrinsic reward function is ﬁnally proposed. With the designed emotional intrinsic motivation module, the multi-agent system always tries to make itself joy, which means it always learns to search the target. To show the effectiveness of the proposed MADDPG-E algorithm, two kinds of simulation experiments with a determined initial position and random initial position, respectively, are carried out, and comparisons are performed with MADDPG as well as MADDPG-ICM (MADDPG with an intrinsic curiosity module). The results show that with the designed emotional intrinsic motivation module, MADDPG-E has a higher learning speed and better learning stability, and the advantage is more obvious when facing complex situations.


Introduction
A multi-agent system is composed of multiple agents.Through communication, cooperation, or competition between agents, the multi-agent system can complete a large number of complex tasks that cannot be completed by a single agent [1].Multi-agent collaborative control, with its advantages of high efficiency, high fault tolerance, and inherent parallelism [2], has been widely used in formation [3], unmanned systems [4], network resource allocation [5], multi-robot cooperative motion planning [6], target search [7], and other fields.
Among the applications described above, target search has attracted wide attention because of its wide application scenarios and practicability.Most of the early studies focused on single-agent target search in a static environment, and the search methods usually used random search or rule search, which greatly reduced the search efficiency when faced with dynamic environments and dynamic targets.In recent years, target search became an application direction of swarm intelligence technology, which can be used for search and rescue, environmental detection, warehouse handling, target roundup, etc. [8].In the above application scenarios, the single agent can no longer search the target well, so researchers gradually pay attention to multi-agent collaborative target search.A wrong strategy will reduce the efficiency of the agent search, so the cooperative strategy between multi-agents is the key of the multi-agent collaborative target search.Hazra et al. [9] introduced the Shapley function and fuzzy Shapley function to facilitate target search in a two-dimensional region with time constraints by minimizing the mission time and fuel usage.Cooper et al. [10] presented a method for analyzing the upper bound for time to find a target under the potential field guidance algorithm assuming a radially expanding search area.Tang et al. [11] proposed an adaptive robotic bat algorithm (ARBA) for multirobot target searching in unknown environments.The obstacle avoidance problem and the mechanism of jumping out of the local optimum are considered.The idea of a method of cooperative strategy between multi-agents is to transform it into a dynamic programming problem, such as Lion Swarm Optimization (LSO) [12], the greedy algorithm [13], model predictive control [14], and other methods, to learn the optimal strategy of agents.Traditional methods usually require complex mathematical calculations to solve multi-agent cooperative target search problems and are prone to easily falling into the local optimum.With the great success of deep learning in the field of artificial intelligence, more and more researchers use the method of deep reinforcement learning (DRL) to find the optimal search strategy.
Deep reinforcement learning has been a hot topic in multi-agent target search in recent years [7,[15][16][17][18].DRL combines the perception ability of deep learning and the decisionmaking ability of reinforcement learning [19], and it provides a solution for the perception and decision-making problems of complex multi-agent systems.In 2017, Tampuu et al. [20] first extended the Deep Q-Learning framework in multi-agent environments between two learning agents to play the the Atari Pong game.The result indicates that DQN can be extended for the learning of multi-agent systems.Lowe et al. [21] proposed the multiagent deep deterministic policy gradient (MADDPG), which effectively solved the problem of a non-stationary environment, and achieved good results in various environments, such as cooperative, competitive, and mixed.In 2018, a new multi-agent policy gradient algorithm [22] was proposed, which solved the high variance gradient estimation problem and could be used to imitate complex behaviors in high-dimensional environments with multiple cooperative or competing agents.Compared with other algorithms, the MADDPG algorithm can be applied to multiple task scenarios such as competition and cooperation between multiple agents.Meanwhile, it can use the observation information from other agents for centralized training, so as to improve the efficiency of the algorithm.Deep reinforcement learning adopts an end-to-end strategy, which is more targeted than traditional methods, but for the problem of sparse rewards in multi-agent target search scenarios, the algorithm stability is still poor.
In the multi-agent target search process based on deep reinforcement learning, however, directly using sparse reward samples will cause neural network training to diverge or even fail to improve the strategy.A straightforward approach to address this problem is to use artificially designed dense rewards.However, such a method has certain limitations, for example, the convergence of the agent's strategy is easy to fall into a local optimum, which has a negative impact on the agent's learning [23].Another way to make the sparse reward is adding goals, uncertainty measures, or intrinsic motivation inside the deep reinforcement learning exploration.Compared with adding the goal and uncertainty measures, the deep reinforcement learning method based on intrinsic motivation [7,[24][25][26] formalizes a variety of heuristic concepts derived from cognitive psychology into intrinsic reward signals to drive the agent to independently and efficiently explore the environment.On the other hand, the internal reward system and motivation are believed to act to differentiate an intelligent being from an unintelligent one [27].Intrinsic motivation can combine with deep reinforcement learning methods based on value functions or policy gradients [7,15] to form a strong heuristic exploration strategy.As early as 2004, Barto et al. [28] studied the use of sophisticated reinforcement learning techniques on a simple novelty-based intrinsic motivation system.Inspired by their work, in [29], thinking that the critic in reinforcement learning can be part of the agent itself [30], Oudeyer et al. presented an intrinsic motivation system named Intelligent Adaptive Curiosity (IAC), which tended to push a robot toward situations in which it maximized its learning progress, and pointed out that any existing reinforcement learning technique could be associated with the IAC drive.Pathak et al. [31] proposed an intrinsic curiosity module (ICM) and formulated curiosity as the error in an agent's ability to predict the consequence of its own actions in a visual feature space learned by a self-supervised inverse dynamics model.
Another important intrinsic motivation is emotion.From the perspective of the embodiment view of the mind, it is assumed that cognition is situated [32], and as an important factor of cognition, emotion is also embodied.A mental representation of emotions emerges during the interaction of the agent's body state, as well as its awareness of the stimulus from the environment in which that state is observed [33].Furthermore, it is believed that emotional responses are characterized by changes in the body state, e.g., behavior [34].During the agents' sensorimotor learning in tasks, they not only observe the current state of the environment but also experience their emotion changes, which further affects the learning process.Researchers in the fields of neuroscience and psychology also demonstrated that emotion is an important part of decision making [35], and both positive and negative emotions can lead to changes in learning motivation.Feldmaier et al. [36] proposed a framework to incorporate an emotional model into the decision-making process of a machine learning agent and used a hierarchical structure to combine reinforcement learning with a dimensional emotional model.Fang et al. [37] proposed an algorithm of pursuit task allocation based on an emotional contagion to study the interaction between affective robots in multi-agent cooperative systems.Guzzi et al. [38] proposed a model for adaptation and implicit coordination in multi-robot systems based on the definition of artificial emotions.Emotions are defined as the robot's situation, and emotions are classified as neutral, fear, frustration, urgency, and confusion.Achiam et al. [39] used the surprise emotion as intrinsic motivation and enabled the agent to succeed in a wide range of environments with high-dimensional state spaces and very sparse rewards.Loyola et al. [25] used the boredom emotion as an intrinsic motivation to generate routes in scenarios where rewards were absent and to facilitate the robot navigation toward the goals.However, in most of these works, emotional intrinsic motivation is applied to a single-agent learning task and is rarely applied to multi-agent collaborative target search.
In this work, considering the multi-agent collaborative target search problem, and based on the MADDPG algorithm [21], firstly, an emotion intrinsic motivation module is designed so as to provide intrinsic rewards, and then a new MADDPG algorithm with emotion intrinsic motivation named MADDPG-E is proposed.In the emotion intrinsic motivation module, according to the multi-agent target search task, three emotions are introduced and defined, which are joy, sadness, and fear.An emotional steady-state variable function is designed to judge the goodness of the emotions and decides the final value of the proposed emotion-based intrinsic reward function.With the introduced emotion intrinsic motivation module, it is hoped that the multi-agent system could always lean toward positive emotions, so as to speed up the task learning speed and optimize the learning performance.
The remainder of this paper is structured as follows.Section 2 introduces some background information.The proposed algorithm as well as the emotional intrinsic motivation design are introduced in Section 3. The simulation experiment results and discussions are presented in Section 4. Section 5 concludes this paper and envisages some future work.

Background
During multi-agent training, the state of each agent changes, and the environment is unstable from the point of view of other agents.Therefore, traditional reinforcement learning methods, such as Q-Learning or policy gradient, are not suitable for multi-agent environments.For this reason, Lowe et al. [21] extended DDPG, a single-agent actor-critic method of deep reinforcement learning, to MADDPG and made it suitable for the multiagent environment.Its algorithm framework is shown in Figure 1.The most core part of the MADDPG algorithm is that the Critic part of each agent can obtain the action information of all the other agents for centralized training and decentralized execution.This means, during the training, the Critic can observe the whole situation and can be introduced to guide the training of the Actor.When testing, the algorithm only uses the Actor network with local observations to take action.In the MADDPG algorithm, the policy set of all the agents is π i = {π 1 , . . . ,π N }, and the gradient of the expected return where p µ is the state distribution.Q π i (x, a 1 , . . . ,a N ) is a centralized action-value function; it takes the actions (a 1 , . . . ,a N ) of all the agents and the observed value of all the agents x = (o 1 , . . . ,o N ) as the input, and the output is the Q value of the agent i.For an agent in one state, it may have different actions as choices; however, according to the Q value calculated by the Critic network, the agent would like to choose the action with the highest Q value so as to obtain the maximum reward.The strategy gradient here in the MADDPG system increases the selective probability of actions with a high Q value and decreases the selective probability of actions with a low Q value.
Lowe et al. considered N continuous policies µ i and extended them to deterministic strategies.The gradient is written as where D is the experience replay buffer, which records the experiences of all the agents.o i is the observed state of agent i.The centralized action-value function where µ = µ θ 1 , . . ., µ θ N is the set of target policies with delayed parameters θ i .γ is the discount factor.
The MADDPG algorithm has the following three characteristics: • The optimal policy obtained by learning only needs to use the local information to take the optimal action.• The environment and special communication requirements are not needed.

•
The algorithm can be used not only in a cooperative environment but also in a competitive environment.

The MADDPG-E Algorithm Framework
The basic idea of MADDPG is its framework of centralized training with decentralized execution.That is, during the training process, the Critic network of each agent collects the state and action information of all the agents, but during the training phase, decisions are made only by each agent's Actor network based on local information for the agent's own actions and states.Due to problems such as a sparse reward and unstable environment, the agent is not motivated to explore, resulting in a low reward and insufficient model convergence.Therefore, the MADDPG algorithm with emotional intrinsic motivation named MADDPG-E is proposed in this paper, and its algorithm framework is shown in Figure 2. In MADDPG-E, a new emotional intrinsic motivation module is designed.The emotional intrinsic motivation reward at each time can be generated according to the environmental stimuli and cognitive states, and such an intrinsic reward as well as the environmental rewards of the agent can then be used as the overall reward of the agent's search process.In this way, not only can the problem of reward sparsity be effectively solved but also collisions can be better avoided., ; ) ; ) The algorithm proposed in this paper consists of six elements: S, O, A, r e (t), R, M .The specific meaning of each element is as follow: • S = {s k |k = 1, 2, . . ., m } is the set of the state space, s k represents the k-th state of the agent, and m represents the state number of the agent.All agents share the same state space.
• O = {O 1 , O 2 , . . . ,O N } is the set of the multi-agent observation space, and N is the agent's number.In each episode, every agent observes the state of the environment through perception, and the agent can obtain the position of obstacles, its own position, the position of other agents, and the position of the target within its detectable range.• A = {A 1 , A 2 , . . . ,A N } is the set of the multi-agent action space, A N is the action of the agent N, and it is mainly related to speed and direction.Then, the action of the agent N at time t + 1 is expressed as where α(t+1) is the movement angle of the agent at time t + 1, β is the change rate of the agent's motion angle, v(t + 1) is the movement speed of the agent at time t + 1, and v is the acceleration.
• r e (t) = r 1 (t), r 2 (t), r 3 (t), r 4 (t) is the set of environmental rewards.All agents share the same environment rewards set.The agent is rewarded for moving to the target location and punished for colliding with the obstacles or bounds.A dynamic penalty function is set for the collision between agents, which can prevent the occurrence of unsafe states to the greatest extent.By setting environmental rewards, agents can learn to move toward the direction with the largest reward value and adopt the search strategy with the largest cumulative reward to help agents search for the target faster.At each time step, the agent changes its state and receives a reward from the environment.The environmental reward function r e (t) in this paper is designed as follows: r 1 (t) = 10 , if the agent has searched the target.r 2 (t) = −2 , if the agent collides with an obstacle.
, if an agent collides with another agent.λ is the collision penalty factor, and (x i (t), y i (t)) is the positions of the agent i at time t.r 4 (t) = −10 × (max(|x i (t)|, |y i (t)|) − 0.9), if max(|x i (t)|, |y i (t)|) ≥ 0.9, which represents the agent colliding with the bounds.
• R(t) is the average value of the rewards obtained by N agents at the time t and is the overall reward of the multi-agent system in the collaborative target search process.The formula is as follows: where (x tar (t), y tar (t)) is the location of the target at the time t.r em i (t) is the emotional intrinsic motivation reward of agent i. r e i (t) is the environmental reward of agent i at the time t.

•
M stands for the memory module, which stores the collected experience with an experience playback array, each of which is a quadruplet {s(t), a(t), s(t + 1), R(t)} as follows: where o N (t) is the observed state of agent N at time t. a N (t) is the action chosen by the agent N at time t.
In our algorithm, for N agents in the system, each agent i has an Actor network The Actor network is deterministic, and for the deterministic inputs (S i , O i ), the output action a i is deterministic.The input of the Critic network is the global state and the actions of all the agents, and the output is a real number.It indicates the degree to which action a is performed based on the state s.The Critic networks are used to evaluate all the actions and guide the Actor networks to make improvements.
The role of the Actor network is to increase the average value of the Critic network by improving the parameters θ µ i through training.The gradient of the expected return for agent i is as follows: where Q i corresponds to the centralized critic of agent i.Its input consists of the agent joint observation o = o 1 , o 2 , . . ., o N as well as the chosen specific actions a 1 , a 2 , . . ., a N of all the agents.The role of the Critic network is to conduct centralized training in joint observation and action.The policy input solely consists of the individual observation o i to choose action a i .Centralized critics for deterministic action policies are optimized with respect to the following loss function.Update the i-th value network with a TD error so that the value network better fits the value function Q(s, a).
The update of the Critic network is as follows: where µ i corresponds to the deterministic policy of agent i.And Q i represents the critic value with delayed parameters for agent i.
Both the Critic target network and the Actor target network use soft update methods for the parameter: Our MADDPG-E algorithm training process takes a centralized training and decentralized execution approach.That is, each agent obtains the actions performed in the current state according to its own strategy and interacts with the environment to obtain the experience stored in its own memory module.After all the agents interact with the environment, each agent randomly draws experience from the pool of experiences to train their own neural network; so as to output the optimal action at each moment to speed up the learning process of an agent, the input to the Critic network includes the observed state and actions taken by other agents by minimizing the loss to more Critic network parameters.The parameters of the updated action network are then calculated based on the gradient descent method.

Emotional Intrinsic Motivation Module
Emotion has the function of driving behavioral adaptation, and the emotion changes can generate learning changes [40].At present, the choice of emotion mainly focuses on the six basic emotions proposed by Ekman: anger, disgust, fear, joy, sadness, and surprise [41].Each basic emotion can be considered as an elementary response pattern, or action tendency [42].Among them, in the robot emotional system, fear is mainly to avoid danger, joy is mainly positive reinforcement behavior, and sadness is dominated by negative reinforcement behavior.
In the MADDPG-E algorithm, three agent states are defined in the process of multiagent target search: It is hoped that the agents reach state 1 more often, which means the agents can search for the target more often, and at the same time, fewer collisions are expected.Therefore, in MADDPG-E, related to the three states defined above, three emotional motivations of joy, sadness, and fear are introduced as shown in Table 1.Among the three emotional motivations, joy belongs to the positive emotional motivation, and sadness and fear belong to the negative emotional motivation.Both positive and negative emotional motivation will generate the corresponding learning motivation and then affect the action choice of the agents.Positive emotional motivation makes the agent move toward the target, while negative emotional motivation prevents the agent from reaching bad states, such as states 2 and 3. Thus, the convergence process of the MADDPG-E algorithm is accelerated.

State Emotional Motivation
The agents search the target Joy The agents do not search the target Sadness The agents are in danger, such as a collision Fear Aiming at the multi-agent search process with three emotional motivations, an emotional steady-state variable function H is defined: are the numbers of the agents reaching state 1, state 2, and state 3. r joy , r sad , and r f ear represent the rewards that the agents obtain from the environment when the agents reach state 1, state 2, and state 3, respectively.
In two adjacent learning episodes, the difference in the value of the agent's emotional steady-state variable function will lead to the generation of emotional changes in the agent.Inspired by the one-dimensional emotional model, we only consider one emotional dimension and define the emotional function E as follows: Emotional intrinsic motivation affects the learning efficiency of the agent because it is an indirect mapping of the information in the learning process of the agent.Therefore, the emotion-based intrinsic reward function is as follows: where T is the maximum time episode, and C is the emotion coefficient, defined as follows: where k is the emotional motivation reward parameter, and the value is in the range of [0, 1].Emotional intrinsic motivation can generate the agents different emotions and act as an intrinsic reward according to the environmental stimulus and cognitive state, which together with the environmental reward serves as the agent's overall reward.In our algorithm, it can be seen from formula (8) that adding emotional intrinsic motivation rewards can better evaluate deterministic strategic actions and guide the agent's action selection.The emotional function E can promote the agent to move in the direction that can generate positive emotion.

Simulation Experiment
The multi-agent collaborative target search problem refers to multi-agents cooperating with each other and avoiding collisions and obstacles to search for targets.Agents can communicate with each other to avoid collisions by sharing location information.In order to verify the effectiveness of our MADDPG-E algorithm, we used the Multi-agent Particle Environment (MPE) [43] provided by OpenAI.The multi-agent cooperative search scenario is shown in Figure 3.In the two-dimensional plane, the red balls represent the three agents, and the speed is relatively slow.The green ball represents the target to be searched for, and its speed is relatively high.Because the target moves faster, it is difficult for a single agent to search for it, so multi-agents need to cooperate to fulfill the task.The black balls represent the two obstacles.The multi-agent collaborative target search task is for the agents to avoid collision and search for the target.Therefore, the distance between the agents, the distance between the agents and obstacles, and the distance between the agents and targets are needed and used as evaluation indicators to measure the multi-agent search strategy.The multi-agent search strategy evaluation index is as follows: D i , D n , D o , and D t are the central coordinates of the agent i, the n-th agent, the obstacle, and the target, respectively.R i , R n , R o , and R t are the the radius of the agent i, the n-th agent, the obstacle, and the target, respectively.Through the quality evaluation index, the status of the agent in the current position and the quality of the current search strategy can be judged.The hyperparameters in the experiment are shown in Table 2. Two experiments are designed in this paper, which are the fixed initial position and random initial position.

Experiment 1: Fixed Initial Position
The fixed initial position experiment means that for every episode of training, it will start with the same positions of the agents, the obstacles, and the target.To show that the proposed MADDPG-E algorithm can help the multi-agent system achieve the objective in any situation, two fixed initial position experiments with different initial positions (as shown in Table 3) are performed.Figure 4a shows the multi-agent target search process under initial positions 1, and Figure 4b shows the results with initial positions 2. The figure shows that the agents find the target at 29 and 32 steps, respectively.It can be seen that the multi-agents can learn good cooperative search strategies through training.As it can be seen from the process of the agent searching for the target, that when facing a target moving faster, agents can use obstacles, boundaries, and other environmental factors to form an encircling strategy for the target so as to achieve the purpose of searching for the target.Two indicators are usually used to evaluate different deep reinforcement learning models: the convergence speed and the reward value after convergence.To show the effectiveness of the emotional intrinsic motivation module in MADDPG-E, we test its performance with the MADDPG algorithm and MADDPG-ICM algorithm (MADDPG algorithm with curiosity intrinsic motivation).The average reward is as in Figure 5.It can be found that, firstly, compared with MADDPG (blue line), both MADDPG-E (green line) and MADDPG-ICM (red line) finally converge to a significantly higher reward value.The larger the reward value after convergence, the more times the agents search for the target, which demonstrates the effectiveness of the intrinsic motivation.Secondly, although with a similar reward value, we can finally see that MADDPG-E begins to converge at about 6000 episodes, while MADDPG-ICM begins to converge at about 7500 episodes.This proves that MADDPG-E converges faster than MADDPG-ICM, which shows the MADDPG-E has a faster learning speed, and the emotional intrinsic module here is useful.Figure 6 shows the mean reward values and standard deviations of the multi-agent system in 6000 episodes of training.We can see that the standard deviation of the MADDPG-E algorithm is the smallest, and the smaller the standard deviation means the better the stability of the algorithm model.This means MADDPG-E in this paper has better learning stability compared with the other two algorithms.In addition, in order to test the performance of the algorithm models during the learning process, a score evaluation index is defined here.In each episode of training, the agents get five points if they find the target and no score if they fail to find the target.The average scores are shown in Figure 7.We can see that the convergence score value of the MADDPG algorithm is significantly lower than the other two algorithms, which means that, during the same learning times, it searches the target fewer times.Let the multi-agent system learn more, and Table 4 gives more details including the average reward, average score, and target search time of the three algorithms in 30,000 episodes.The results show that the MADDPG-E algorithm has a shorter average target search time, higher model efficiency, faster target search, and more successful target search times than the other two algorithms.It shows that the algorithm proposed in this paper has great advantages both in the search speed and the number of successful searches.

Experiment 2: Random Initial Position
To show the robustness of the proposed MADDPG-E algorithm, here, the random initial position experiments are performed, which means for every episode of training, the positions of the agents, the obstacles, and the target are randomly generated.Figure 8 shows three groups in the multi-agent target search process.It can be seen that although the system starts to learn from different situations, it can still search the target, and the agents learn a good collaborative target search strategy, which proves the strong learning ability of the MADDPG-E algorithm.
Figure 9 shows the average reward during the training process.It can be seen that in the early stage (before about 6000 episodes of training), the reward curves of all three algorithms fluctuate heavily because of the excessive exploration.However, the MADDPG-E algorithm begins to converge after about 10,000 episodes of training, which is faster than the other two algorithms.Because MADDPG-E adds emotional intrinsic motivation, it can increase the reward that the multi-agent system obtains and can accelerate the learning speed as well as the convergence speed.What is more, compared with MADDPG-ICM and MADDPG, the reward line of MADDPG-E seems smoother, which means its learning result is more stable and less oscillation happens.Figure 10 shows the mean reward value and standard deviation of the system after 5000 episodes of training in experiment 2. What can be seen is that after convergence, the average standard deviation of the reward of the MADDPG-E algorithm is smaller than that of the other two algorithms, indicating that the MADDPG-E model has better stability.The curves of the average score of the three algorithms in experiment 2 are shown in Figure 11.It shows that the MADDPG-E algorithm has a significantly higher average score than the other two algorithms, and the agents successfully search for the target more times.What is more, in the early stage of exploration, MADDPG-E is more stable.Table 5 shows the values of the agent's average reward, average score, and average target search time in 30,000 rounds of training.Compared with the other two algorithms, the MADDPG-E algorithm in this paper searches for the target in a shorter time and more times, indicating that the MADDPG-E algorithm is more efficient in task learning.Compared with the experimental results of the fixed initial position, the MADDPG-E algorithm proposed in this paper has more obvious advantages over the other two algorithms in the complex random initial position experiments, whether it is the size of the reward value after convergence or the convergence speed and stationarity, which indicates that the proposed MADDPG-E algorithm can better adapt to complex and unknown situations.

Conclusions
In this paper, an improved MADDPG algorithm is proposed to solve the sparse reward problem in a multi-agent collaborative target search.Under the framework of the MADDPG algorithm, a new module of emotional intrinsic motivation is added, and three kinds of emotional motivation, including joy, sadness, and fear, are introduced.The emotional intrinsic motivation module can generate corresponding intrinsic rewards according to different states of the multi-agent collaborative target search process to accelerate the learning speed as well as optimize the learning process.Two kinds of simulation experiments are then carried out in this paper, and the results show that the proposed MADDPG-E algorithm can learn a good search strategy and has higher search efficiency and better stability.
At present, the proposed MADDPG-E algorithm has a good effect on the multiagent target search task.However, a limitation of this study is that we do not make role distinctions for agents, which can lead to increased training time, especially in some real-world complex scenarios.Therefore, in future work, we will distinguish the roles of the multi-agents and improve our emotional intrinsic motivation module to make our method more suitable for the complex environment and improve the success rate of the target search.

10 )m
represents the m-th learning episode of the agent.H m is the steady-state variable function of the emotional motivation change inside the agent at the learning episode m, and H change mrepresents the emotional motivation change value within the agent in the learning episode m, which can be calculated by the following formula, ear + ωK sad m r sad(11) ϕ, δ, and ω are the weight parameters of the internal changes in the three emotional motivations of joy, fear, and sadness, respectively, which determine the degree of influence of the external environmental information on internal emotional changes, and the value is in the range [0, 1].

Figure 3 .
Figure 3.The multi-agent collaborative target search scenario.
(a) Fixed initial positions experiment 1 (b) Fixed initial positions experiment 2

Figure 4 .
Figure 4. Multi-agent target search experiment with fixed initial positions.

Figure 6 .
Figure 6.Average reward value and standard deviation in experiment 1.
(a) Randomized initial position experiment 1 (b) Randomized initial position experiment 2 (c) Randomized initial position experiment 3

Figure 8 .
Figure 8. Multi-agent target search experiment with randomized initial positions.

Figure 10 .
Figure 10.Average reward value and standard deviation in experiment 2.

Table 2 .
The training parameters.

Table 3 .
Initial position for the agents, targets, and obstacles.

Table 4 .
Average reward value, average score, and average target search time in 30,000 episodes.

Table 5 .
Average reward value, average score, and average target search time in 30,000 episodes.