Learning System-Optimal and Individual-Optimal Collision Avoidance Behaviors by Autonomous Mobile Agents

Katsutoshi Hirayama; Kazuma Gohara; Jinichi Koue; Tenda Okimoto; Donggyun Kim

doi:10.3390/a18110671

,

and

¹

Graduate School of Maritime Sciences, Kobe University, Kobe 658-0022, Hyogo, Japan

²

Division of Navigation Science, Mokpo National Maritime University, Mokpo-si 58628, Jeollanam-do, Republic of Korea

^*

Author to whom correspondence should be addressed.

^†

Current address: Japan Airlines Co., Ltd., Tokyo 140-8637, Japan.

Algorithms2025, 18(11), 671;https://doi.org/10.3390/a18110671

This article belongs to the Special Issue Multi-Objective and Multi-Level Optimization: Algorithms and Applications (2nd Edition)

Version Notes

Order Reprints

Abstract

Automated collision avoidance is a central topic in multi-agent systems that consist of mobile agents. One simple approach to pursue system-wide performance is a centralized algorithm, which, however, becomes computationally expensive when involving a large number of agents. There have thus been proposed fully distributed collision avoidance algorithms that can naturally handle many-to-many encounter situations. The DSSA⁺ is one of those algorithms, which is heuristic and incomplete but has lower communication and computation overheads than other counterparts. However, the DSSA⁺ and some other distributed collision avoidance algorithms basically optimize the agents’ behavior only in the short term, not caring about the total efficiency in their paths. This may result in some agents’ paths with over-deviation or over-stagnation. In this paper, we present Distributed Stochastic Search algorithm with a deep Q-network (DSSQ), in which the agents can generate time-efficient collision-free paths while they learn independently whether to detour or change speeds by Deep Reinforcement Learning. A key idea in the learning principle of the DSSQ is to let the agents pursue their individual optimality. We have experimentally confirmed that a sequence of short-term system-optimal solutions found by the DSSA⁺ gradually becomes long-term individually optimal for every agent.

Keywords:

collision avoidance; distributed problem solving; reinforcement learning

1. Introduction

Automated collision avoidance is a central topic in multi-agent systems that consist of mobile agents, like robots [1], vessels [2], vehicles [3], aerial robots [4], or others, and has been well studied in past decades. To realize this technology in the real world, we need to address various issues, including the design and implementation of basic algorithms, the kinematic constraints and maneuverability of mobile agents, and environmental dynamics. In this paper, we focus on the design of basic algorithms that are independent of specific mobile agents, and address the most important principle in the design: the balance between safety and time efficiency.

One popular approach to collision avoidance algorithms is one-to-many algorithms, which are designed for one single agent that encounters multiple other agents in its vicinity. One-to-many algorithms are further classified by inference-based algorithms [1, 5,6,7,8] and learning-based algorithms [9,10,11]. One drawback of these one-to-many algorithms is not considering system-wide performance. More specifically, they ignore the fact that decisions made by the agents interact with each other by nature. Consequently, if we made every agent simultaneously perform a one-to-many algorithm, the system-wide performance is likely to become unstable.

The field of multi-agent path finding has tried to overcome this drawback by using centralized algorithms, which compute all the paths of the agents at a central computational server [12,13]. Although system optimality is ensured by using an exact algorithm at the server, a centralized algorithm generally becomes computationally expensive, especially when involving a large number of agents.

On the other hand, in order to remove the necessity of a central computational server and also to distribute its computational cost over the agents themselves, many studies have introduced distributed algorithms or many-to-many algorithms by assuming a communication network system among mobile agents, such as AIS or VDES [14,15,16,17,18,19,20,21,22,23,24].

Among these distributed collision avoidance algorithms, this paper focuses on the DSSA⁺ [21], which iteratively applies an incomplete but simple heuristic algorithm for the Distributed Constraint Optimization Problem (DCOP) [25] to compute a collision-free and efficient course and speed for each agent. Because the DSSA⁺ is essentially a heuristic search algorithm that directly finds sub-optimal courses and speeds using a simple cost function, it is expected to operate more agilely compared to other ADMM-based distributed collision avoidance algorithms [17,18,19,20,23,24], which are based on the consensus among neighboring agents on the values of state variables [26]. Some agents in the DSSA⁺ and most other distributed collision avoidance algorithms may sometimes end up overly deviating from their shortest courses or overly stagnating at a minimum speed when avoiding collisions in very complicated situations. These behaviors are caused by the fact that the DSSA⁺ generally optimizes the agents’ decisions only in the short term, not caring about the long-term efficiency. In other words, it enforces every agent to follow a short-term system-optimal solution, which is an optimal course and speed in the system-wide point of view, but may not be necessarily optimal in the individual agent’s point of view. In the DSSA⁺, such gaps between system optimality and individual agent optimality over iterations are monotonically piled up to deteriorate the long-term efficiency of some agents’ paths. Furthermore, such long-term deviation from individual agent optimality may cause unexpected unfairness among agents in terms of their efficiency.

To overcome this issue, we present a novel framework of combining the DSSA⁺ and Deep Reinforcement Learning (DRL) [27]. We introduce a new algorithm called Distributed Stochastic Search algorithm with a deep Q-network (DSSQ), in which the agents in the DSSA⁺ can generate time-efficient collision-free paths while they learn independently whether to detour or change speeds depending on the situation, and conquer the unfairness in long-term efficiency among themselves through their individual learning. A key idea in the learning principle of this framework is to let the agents pursue their individual optimality. In this paper, we will show experimentally that a sequence of short-term system-optimal solutions found by the DSSA⁺ gradually becomes long-term individually optimal for every agent through DRL.

At the end of this section, we think it is worthwhile to present the collision avoidance example that motivated this work: a performance called “Synchronized Walking” [28], in which each player is expected to learn both system-optimal and individual-optimal collision avoidance behavior through repeated training. We consider this learning process to be one way to realize “intelligent” collision avoidance, and we would like to explore its mechanism by a constructive approach using an abstract model that ignores physical details.

The rest of this paper is organized as follows. We first provide an overview of recent studies on collision avoidance and then describe the DSSA⁺ and DRL as necessary backgrounds. Next, we detail how to integrate the DSSA⁺ and DRL, followed by experiments to see how the performance of the DSSA⁺ is improved by incorporating DRL. Finally, we conclude this work and give some future directions.

2. Related Work

One popular approach to automated collision avoidance is using one-to-many algorithms, which are designed for a single agent that encounters multiple other agents in its vicinity. These algorithms are further classified by inference-based algorithms and learning-based algorithms. Some inference-based algorithms use the velocity obstacle (VO) of other mobile agents obtained by observation. The ORCA [8] and its variants [1,5] are typical VO-based algorithms, where an agent navigates itself among mobile agents while solving a linear programming problem. Some other inference-based algorithms exploit the future trajectories of other mobile agents, which are predicted by the agent [6,7]. Recently, several studies [9,10,11,29] have proposed learning-based one-to-many algorithms, where an agent performs DRL to obtain an optimal collision avoidance policy through experience. One drawback of these one-to-many algorithms is that they focus only on the view of one specific agent and do not consider system-wide performance.

To deal with the system-wide issues in collision avoidance, communication among agents, either explicit or implicit, should be required. Centralized algorithms, which compute all the paths of the agents at a central computational server, also belong to this category since all information is transmitted to a server. In a discrete domain where a problem is formulated on a graph, multi-agent path finding (MAPF) algorithms [12] can provide collision-free optimal paths when all information is collected. In a continuous domain, there is a study where a central planner computes the trajectories for all agents utilizing a geometric algorithm [13]. However, considering the fact that a central computational server along with system-wide perfect information is needed, and also the fact that the computational cost on the server is very expensive, we believe that centralized algorithms do not suit environments where quick decisions need to be made, such as collision avoidance.

Distributed algorithms or many-to-many algorithms are thus provided. C-Nav [30] has proposed implicit coordination, where agents in a crowded environment cooperate with their neighbors by taking into account their intended velocities. A distributed negotiation scheme has been proposed in [31] for generating trajectories in a cooperative manner.

A lot of distributed collision avoidance algorithms have been proposed, mainly for maritime transportation, assuming the use of inter-vessel communication network systems, such as AIS or VDES. In particular, in recent years, there has been active research into ADMM-based distributed collision avoidance based on the augmented Lagrangian relaxation method, and several algorithms have been proposed that take into account ship motion and are theoretically guaranteed to converge to a stationary point [17,18,19,20,23,24]. On top of these, also for the autonomous vehicle platoon control problem, several studies have been conducted under the framework of distributed model predictive control, and more recently, its multilayer extension [32,33,34,35].

On the other hand, as an approach different from ADMM, research has been conducted on general-purpose distributed collision avoidance algorithms based on the discrete and simple DCOP model. Several complete DCOP algorithms, such as SyncBB [36], DPOP [37], and AFB [38], were recently tested [22]. Furthermore, a heuristic DCOP algorithm called the DSA [25] gives a foundation of efficient and agile distributed collision avoidance algorithms, called the DSSA [16] and DSSA⁺ [21]. An explicit DCOP formulation is used to deal with collision avoidance issues in the distributed target coverage problem [39]. The DSSA and DSSA⁺ can operate more agilely than ADMM-based distributed collision avoidance algorithms because the models they solve are essentially unconstrained nonlinear discrete optimization problems, as described in the next section, whereas the models solved by ADMM-based algorithms are essentially constrained nonlinear continuous optimization problems. We thus focus on the DSSA⁺ in this paper.

3. Distributed Stochastic Search Algorithm

3.1. Framework

The framework of our distributed collision avoidance assumes that every agent alternates the control phase and the search phase at each discrete time step t. In the control phase, if there are no other agents in its detection range, the agent who is not yet at its destination can move immediately to the next intended position. On the other hand, if there are some other agents called neighbors in its detection range, the agent and its neighbors go into the search phase, where they jointly try to find appropriate actions to avoid collisions by running a distributed collision avoidance algorithm.

The DSSA⁺ [21] is one such distributed collision avoidance algorithm. Every agent i in the DSSA⁺ can change its heading by

Δ θ \in {- 45^{\circ}, - 40^{\circ}, \dots, + 45^{\circ}} \cup {θ_{d_{i}}^{t} - θ_{i}^{t}}

and its speed by

Δ v \in {- 8, - 6, \dots, + 8}

, and the pair of values

(Δ θ, Δ v)

are called intentions and are depicted by dots in Figure 1. For example, the large red dot in the center of Figure 1 indicates an intention of

(Δ θ = 0, Δ v = 0)

, which means to maintain the current course and speed. The red dot immediately to the left of it indicates an intention of

(Δ θ = - 5, Δ v = 0)

, which means to change course by 5 degrees to the left while maintaining the current speed. The dot just below the large red dot indicates an intention of

(Δ θ = 0, Δ v = - 2)

, which means to maintain the current course but reduce speed by two units. Note that

p_{d_{i}}

is the destination of agent i, and

θ_{d_{i}}^{t}

is the absolute angle that corresponds to the direction toward

p_{d_{i}}

. The size of possible intentions (

D o m

) is 180.

Figure 1. Possible intentions (

D o m

). The large red dot in the center indicates an intention of

(Δ θ = 0, Δ v = 0)

, which means to maintain the current course and speed. Other red dots also indicate other possible intentions. Note that

p_{d_{i}}

is the destination of agent i, and

θ_{d_{i}}^{t}

is the absolute angle that corresponds to the direction toward

p_{d_{i}}

.

On the other hand, the original DSSA [16] assumes that a mobile agent cannot change its speed, but can only change its course, i.e.,

Δ v

is always zero and the intention is only

Δ θ

, not

(Δ θ, Δ v)

. In other words, the DSSA is functionally subsumed by the DSSA⁺.

3.2. DSSA⁺

The pseudo code of the DSSA⁺ is shown in Algorithm 1. At round 0, agent i initializes

i n t e n t i o n_{i}

by

(0, 0)

, which indicates keeping the current course and speed (line 1). Then agent i exchanges its current position, current heading

θ_{i}^{t}

, current speed

v_{i}^{t}

, and

i n t e n t i o n_{i}

with its current neighbors

N e i g h b o r s_{i}^{t}

(line 3). Receiving all of the neighbors’ information, agent i computes

C o s t_{i} (Δ θ, Δ v)

by the following formula for every possible intention (line 4):

{C o s t}_{i} (Δ θ, Δ v) \equiv \sum_{j \in N e i g h b o r s_{i}^{t}} C R_{i} (Δ θ, Δ v, j) + E F_{i} (Δ θ, Δ v),

(1)

where

C R_{i} (Δ θ, Δ v, j) \equiv \{\begin{matrix} \frac{TimeWindow}{TCPA (Δ θ, Δ v, j)} & if i will collide with j, \\ 0 & otherwise, \end{matrix}

(2)

E F_{i} (Δ θ, Δ v) \equiv α \cdot \frac{| (θ_{i}^{t} + Δ θ) - θ_{d_{i}}^{t} |}{π} + β \cdot \frac{| min {max {v_{i}^{t} + Δ v, v_{i}^{m i n}}, v_{i}^{m a x}} - v_{i}^{r e f} |}{v_{i}^{r e f}} .

(3)

C R_{i} (Δ θ, Δ v, j)

in Equation (2) computes collision risk against neighboring agent j when i changes its current course by

Δ θ

and its current speed by

Δ v

.

TimeWindow

is a constant length of time step to predict neighbors’ future positions.

TCPA (Δ θ, Δ v, j)

returns the time to the closest point of approach against j when i changes its current course by

Δ θ

and current speed by

Δ v

, while agent j assumes to follow the received intention. It can be simply calculated based on Euclidean geometry by using j’s information. Hence, if agents i and j may collide sometime in the future in

TimeWindow

, the shorter the time until that collision, the larger the value of

C R_{i} (Δ θ, Δ v, j)

becomes. These collision risks are summed up for all of the neighbors in computing the first term of

C o s t_{i} (Δ θ, Δ v)

.

E F_{i} (Δ θ, Δ v)

in Equation (3) computes inefficiency when agent i changes its current course by

Δ θ

and current speed by

Δ v

. It consists of two terms, where the first term computes inefficiency on the course and the second term computes inefficiency on the speed.

The first term of

E F_{i} (Δ θ, Δ v)

will be zero when the next course, which is the sum of the current course

θ_{i}^{t}

and course change

Δ θ

, is exactly the same as the direction towards agent i’s destination, namely

θ_{d_{i}}^{t}

. Similarly, the second term will be zero when the next speed, which is the sum of the current speed

v_{i}^{t}

and speed change

Δ v

, is exactly the same as the reference speed

v_{i}^{r e f}

. The reference speed is the most desirable one for agent i to follow consistently. Note that

v_{i}^{m a x}

and

v_{i}^{m i n}

are the maximum and minimum speed of agent i, respectively.

These two terms are summed up by multiplying the weighting factors

α

and

β

, respectively, in computing inefficiency with Equation (3). Interestingly, when we set these values like

α > β

, agent i prefers to change its speed rather than its course to avoid collisions, and when we set these values like

α < β

, agent i prefers to change its course rather than its speed to avoid collisions. In the original DSSA⁺, the values of these weighting factors were fixed throughout algorithm execution.

By using Equation (1), the agents try to decide their own intentions through exchanging their tentative intentions with their neighbors at each round. When receiving neighbors’ tentative intentions, agent i will try to select the values to

Δ θ

and

Δ v

that realize the maximum cost reduction.

C o s t_{i} (i n t e n t i o n_{i})

indicates the cost of the current intention, and

n e w_i n t e n t i o n_{i}

is a pair of course and speed changes that minimize the cost (line 5). Thus

i m p r o v e m e n t_{i}

means a maximum cost reduction for i (line 6). When

i m p r o v e m e n t_{i} > 0

, agent i tries to renew its tentative intention

i n t e n t i o n_{i}

by the stochastic walk, which makes i change

i n t e n t i o n_{i}

to

n e w_i n t e n t i o n_{i}

with probability p, while on the other hand, making it keep the tentative intention with probability

1 - p

(lines from 7 to 9).

Agent i continues to perform this process over the space of

i n t e n t i o n_{i}

every round, aiming at the state where all of the agents are satisfied with their current intentions. This iteration can be terminated if all of the agents reach quiescence or the computation time exceeds a time limit (lines from 10 to 14).

After that, the DSSA⁺ finds

(Δ θ, Δ v)

, which are more likely to be system-wide optimal course and speed changes for each agent i. By using them, the next course and speed of agent i will be updated, and i will move to the next position (line 16).

We would like to clarify the objective function for system optimality that is repeatedly mentioned so far, which is the following global objective function that the DSSA⁺ attempts to minimize:

\sum_{i \in A g e n t s} C o s t_{i} (Δ θ_{i}, Δ v_{i}),

(4)

where

A g e n t s

is a set of agents that participate in the DSSA⁺, and both

Δ θ_{i}

and

Δ v_{i}

are decision variables that take the intention of agent i as their values. Note that although the solution obtained by the DSSA⁺ may be optimal for this global objective function, it is not necessarily optimal for each individual agent, which typically wants to proceed straight to its destination at its reference speed.

Algorithm 1 DSSA⁺

Input:: current position $p_{i}^{t}$ , current speed $v_{i}^{t}$ , current heading $θ_{i}^{t}$ , a set of current neighboring agents $N e i g h b o r s_{i}^{t}$ , cost function $C o s t_{i} (\cdot)$ , threshold p on stochastic walk
Output:: heading $θ_{i}^{t + 1}$ , speed $v_{i}^{t + 1}$
1:: $r o u n d \leftarrow 0$ , $i n t e n t i o n_{i} \leftarrow (0, 0)$
2:: while True do
3:: Exchange information with $N e i g h b o r s_{i}^{t}$
4:: Calculate $C o s t_{i} (\cdot)$ for every $(Δ θ, Δ v)$
5:: $n e w_i n t e n t i o n_{i} \leftarrow a r g m i n C o s t_{i} (Δ θ, Δ v)$
6:: $i m p r o v e m e n t_{i} \leftarrow C o s t_{i} (i n t e n t i o n_{i}) - C o s t_{i} (n e w_i n t e n t i o n_{i})$
7:: if $i m p r o v e m e n t_{i} > 0$ and $R a n d (\cdot) < p$ then
8:: $i n t e n t i o n_{i} \leftarrow n e w_i n t e n t i o n_{i}$
9:: end if
10:: if exceed time limit or i and $\forall j \in N e i g h b o r s_{i}^{t}$ are satisfied with their own intentions
then
11:: break
12:: else
13:: $r o u n d \leftarrow r o u n d + 1$
14:: end if
15:: end while
16:: $(θ_{i}^{t + 1}, v_{i}^{t + 1}) \leftarrow (θ_{i}^{t}, v_{i}^{t}) + i n t e n t i o n_{i}$
17:: return $θ_{i}^{t + 1}$ , $v_{i}^{t + 1}$

4. Deep Reinforcement Learning

4.1. Q-Learning

Reinforcement Learning (RL) [40] is one of the machine learning frameworks for solving a sequential decision-making process, where an agent aims to maximize the total sum of rewards given by the environment. RL is based on the Markov decision process (MDP), denoted by a tuple

M = ⟨ S, A, T, r, γ ⟩

, where S is a state space that indicates the set of all possible environmental states, A is an action space that indicates the set of all possible actions, T is environmental state transition probabilities, r is a reward function, and

γ \in [0, 1]

is a discount factor. An agent observes state

s^{t}

of the environment at some time step t to perform action

a^{t}

, following its policy

π (a | s)

. Then, at the next time step

t + 1

, it observes a new state

s^{t + 1}

while obtaining a reward

r^{t + 1}

. The return

G^{t}

at time step t is usually defined as the discounted total sum of rewards, namely

G^{t} \equiv \sum_{k = 0}^{T - t - 1} γ^{k} r^{t + k + 1}

. The goal of RL is to find the optimal policy

π^{*}

that maximizes the total sum of rewards from

t = 0

to

t = T

.

In value-based RL algorithms, the optimal policy

π^{*}

can be learned indirectly by estimating the value function

V^{π} (s^{t})

, which is the expected return by starting from state

s^{t}

and following policy

π

, namely

V^{π} (s^{t}) = E^{π} [G^{t} | S^{t} = s^{t}]

. Instead of

V^{π} (s^{t})

, we can consider the Q-function

Q^{π} (s^{t}, a^{t})

, which is the expected return by taking action

a^{t}

at state

s^{t}

and following policy

π

after that. According to the Bellman equation, the optimal Q-function

Q^{*} (s^{t}, a^{t})

, which indicates the maximum expected return by taking action

a^{t}

at state

s^{t}

and following the optimal policy

π^{*}

after that, satisfies the condition of

Q^{*} (s^{t}, a^{t}) = E [r^{t + 1} + γ \cdot {max}_{a} Q^{*} (s^{t + 1}, a)]

for any state–action pair. Q-learning [41] is the algorithm for estimating the optimal Q-function by sequential value iteration through the actual interaction between an agent and the environment. More specifically, the optimal Q-value will be obtained for any state–action pair by minimizing the Temporal Difference error (TDerror), which is computed by

r^{t + 1} + γ \cdot m a x_{a} Q (s^{t + 1}, a) - Q (s^{t}, a^{t})

. Since this estimation does not depend on the behavioral policy of the agent, the optimal Q-function can be learned sample-efficiently.

4.2. Deep Q Network

When the state and action spaces are discrete and finite, all possible Q-values can be stored in a simple table in principle. However, when these spaces become huge, such a table cannot be maintained in practice. To overcome this difficulty, the deep Q-network (DQN) [27] has been proposed as the very first Deep Reinforcement Learning algorithm, where a Deep Neural Network called a DQN is used to approximate the optimal Q-function.

In the DQN, an agent stores its experience as a tuple

⟨ s^{t}, a^{t}, r^{t + 1}, s^{t + 1} ⟩

in its replay buffer D, and the DQN parameterized by w is trained by Experience Replay, which is updating the DQN with a batch of experiences randomly sampled from D for the purpose of disconnecting the time correlation between data of time step t and

t + 1

that distracts the learning of the DQN. Basically, the DQN is updated for minimizing the following loss function:

L (w) \equiv E_{(s^{t}, a^{t}, r^{t + 1}, s^{t + 1}) \sim D} [(r^{t + 1} + γ \cdot max_{a^{'}} Q (s^{t + 1}, a^{'}; w^{-}) - Q (s^{t}, a^{t}; w))^{2}],

(5)

where

w^{-}

indicates the parameters of the target DQN, which tries to maintain the target Q-value for any state–action pair. Minimizing

L (w)

is equivalent to minimizing TDerror and learning the optimal Q-function

Q^{*} (s^{t}, a^{t}) = E_{⟨ s^{t}, a^{t}, r^{t + 1}, s^{t + 1} ⟩ \sim D} [r^{t + 1} + γ \cdot {max}_{a} Q^{*} (s^{t + 1}, a)]

.

5. Approach

5.1. Basic Idea

Agent i’s path from origin

p_{o_{i}}

to destination

p_{d_{i}}

is constructed by the sequence of optimal solutions each obtained by the DSSA⁺. However, the paths of some agents as a series of the system-wide optimal solutions sequentially produced by the DSSA⁺ may form inefficient ones like a tortuous trajectory or an iteration of accelerations and decelerations, while, on the other hand, those of other agents in the same scenarios may happen to form long-term efficient ones in contrast. This is due to the fact that the DSSA⁺ optimizes agent i’s action only reactively, ignoring its individual long-term optimality. Consequently, the nature of short-term optimization by the DSSA⁺ may cause the system-wide high variance in terms of agents’ total time steps.

To deal with this problem, we firstly introduce the notion of

l o s s

, which measures the “distance” between the solution

(θ_{i}^{t}, v_{i}^{t})

for agent i obtained after running the DSSA⁺ and an ideal solution

(θ_{d_{i}}^{t}, v_{i}^{r e f})

for agent i. The ideal solution for agent i is a pair of the course

θ_{d_{i}}^{t}

that leads i directly to its destination

p_{d_{i}}

and its own reference speed

v_{i}^{r e f}

. We define the loss of agent i at time step t as follows:

l o s s_{i}^{t} \equiv \frac{| θ_{i}^{t} - θ_{d_{i}}^{t} |}{C_{1}} + \frac{| v_{i}^{t} - v_{i}^{r e f} |}{C_{2}},

(6)

which is a simple linear combination of the absolute angular difference between

θ_{i}^{t}

and

θ_{d_{i}}^{t}

and the absolute difference between the speeds

v_{i}^{t}

and

v_{i}^{r e f}

.

C_{1}

and

C_{2}

are positive constant values that equalize the range of these two terms.

The inefficiency of the path of agent i from

p_{o_{i}}

to

p_{d_{i}}

that is obtained as a sequence of DSSA⁺ solutions could be measured as

L o s s_{i}^{t} \equiv \sum_{t^{'} = t}^{T} l o s s_{i}^{t^{'}},

(7)

where T is the time step at which agent i reaches its destination.

The objective of this paper is to build an algorithm where agent i pursues not only cooperating with other agents following the DSSA⁺, but also minimizing

L o s s_{i}^{0}

by itself so that the obtained path satisfies long-term individual optimality. More specifically, we try to develop the rule of changing

α

and

β

in Equation (3) adaptively, depending on the situation at each time step. Agent i would decide each time whether to detour or change speeds to avoid collisions efficiently, by changing

α

and

β

according to this rule, and come to generate a long-term efficient path as a result.

5.2. Distributed Stochastic Search Algorithm with Deep Q-Network (DSSQ)

5.2.1. Overview of DSSQ

In this paper, we introduce a new algorithm called Distributed Stochastic Search algorithm with a deep Q-network (DSSQ), in which we incorporate an extended version of a DQN into the DSSA⁺. The main reason that we adopt a simple value-based algorithm is due to the sample efficiency of the DQN, by which we can expect an agent to obtain a better policy effectively in practice, even in a complex multi-agent situation. Although we acknowledge that the learning in the DSSA⁺ may possibly suffer from the partial observability of individual agents, we will aim to make the best of the simplicity and efficiency of the DQN as a first step. The outline of the DSSQ is shown in Algorithm 2. A rough sketch of its technical ideas is as follows.

Algorithm 2 DSSQ

1:: Initialize $D Q N_{i} (\cdot)$ and replay buffer $D_{i} \leftarrow ϕ$
2:: for $e p i s o d e = 1, \dots, M$ do
3:: Initialize position $p_{i}^{0} = p_{o_{i}}$ , speed $v_{i}^{0} = v_{i}^{r e f}$ , heading $θ_{i}^{0} = θ_{d_{i}}^{0}$
4:: for time step $t = 0, \dots, T$ do
5:: if $p_{i}^{t} = p_{d_{i}}$ then
6:: break
7:: else
8:: Exchange information with $N e i g h b o r s_{i}^{t}$ to observe state $s_{i}^{t}$
9:: $α_{i}^{t}, β_{i}^{t} \leftarrow D Q N_{i} (s_{i}^{t})$
10:: Set $α_{i}^{t}$ , $β_{i}^{t}$ in $C o s t_{i} (\cdot)$
11:: $(θ_{i}^{t + 1}, v_{i}^{t + 1}) \leftarrow D S S A^{+} (\cdot)$
12:: Move to next position $p_{i}^{t + 1}$
13:: Get reward $r_{i}^{t + 1}$ and observe next state $s_{i}^{t + 1}$
14:: $D_{i} \leftarrow D_{i} \cup ⟨ s_{i}^{t}, (α_{i}^{t}, β_{i}^{t}), r_{i}^{t + 1}, s_{i}^{t + 1} ⟩$
15:: end if
16:: end for
17:: Make minibatch from $D_{i}$
18:: Update $D Q N_{i} (\cdot)$
19:: end for

We view the problem of minimizing the long-term inefficiency

L o s s_{i}^{0}

of agent i’s path as a DRL problem. We simply set the reward function that computes a reward given to agent i at time step t to

r_{i}^{t} \equiv - l o s s_{i}^{t}

, and aim to minimize

L o s s_{i}^{t}

by getting each agent to maximize its total sum of rewards independently using DRL.

To enhance the performance of the DQN, we use some techniques like Double DQN [42], Prioritized Experience Replay [43], multi-step learning [40,44], and the dueling network [45]. These were also used in one of the state-of-the-art value-based DRL algorithms [46].

5.2.2. State Space

One advantage of using a Deep Neural Network in RL is being able to use raw and high-dimensional observation data as state

s^{t}

as it is. This enables an agent to learn both how to represent the environment and how to behave to achieve the task in that environment. If this idea is simply applied to the DSSQ, the raw data observed by agent i in its detection range might have been one strong candidate for the state space. However, as shown in [46] or others, a huge number of experimental epochs are required to handle such raw data even in a single-agent problem. Therefore, in this paper, we adopt more abstract and lower-dimensional data that are extracted from the raw data as the state space.

As such an abstract and lower-dimensional state space, we use the cost distribution over the domain

D o m

of intentions in Figure 1 just before starting the DSSA⁺ at each time step t. In fact, this cost distribution contains adequate information about both collision risks and inefficiency computed based on current intentions of neighboring agents while abstracting the raw data within a detection range.

Note that the cost distribution over

D o m

varies as the DSSA⁺ at time step t proceeds, because neighboring agents constantly change their intentions during the execution of the DSSA⁺. Among such constantly varying cost distributions, we will adopt the very first one, which is the one just before starting the DSSA⁺, as a state. Considering an action, which is the values of

α

and

β

in the cost function, will be determined based on the state, we believe this is a reasonable choice.

Seeing Equation (1), the cost is actually the sum of collision risk and inefficiency, but it would be more convenient to handle them separately as a state. Hence, to be more specific, we first compute the value of collision risk by the first term of Equation (1) for every possible value for course change

Δ θ

and speed change

Δ v

, and serialize them to produce a vector

{CR}_{i}^{t} (\in R^{180})

. Then, two scalar values

| θ_{i}^{t} - θ_{d_{i}}^{t} |

and

| v_{i}^{t} - v_{i}^{r e f} |

, which are the deviations from the ideal values of the current heading and speed to be used in computing inefficiency in Equation (3), are added. Namely, we define the state of agent i at time step t by

s_{i}^{t} \equiv {CR}_{i}^{t} \cup {| θ_{i}^{t} - θ_{d_{i}}^{t} |, | v_{i}^{t} - v_{i}^{r e f} |} .

(8)

5.2.3. Action Space

Changing

α

and

β

in Equation (3) of the DSSA⁺ can be viewed as an action in the context of RL. Action

a_{i}^{t}

of agent i at time step t is selected from a predetermined finite set of values against

α

and

β

, such as

a_{i}^{t} \equiv (α_{i}^{t}, β_{i}^{t}) \in {(0.1, 0.9), (0.5, 0.5), (0.9, 0.1)} .

(9)

We could have made the size of each agent’s action space even larger. However, because this learning problem is inherently hard, also known as multi-agent learning, we decided to keep the size of each agent’s action space as small as possible as a first step toward solving it in a purely distributed environment.

5.2.4. Reward Function

We also define the reward function for agent i at time step t as follows:

r_{i}^{t} \equiv \{\begin{matrix} - {loss}_{i}^{t} & if i did not collide with others, \\ r_{c o l} & otherwise, \end{matrix}

(10)

When agent i settles down to heading directly to its destination and its reference speed as a result of changing

α

and

β

in the DSSA⁺ at time step t, namely

θ_{i}^{t + 1} = θ_{d_{i}}^{t + 1}

and

v_{i}^{t + 1} = v_{i}^{r e f}

, the reward would be the maximum. On the other hand, when agent i settles down to some other heading and speed, the reward would be reduced from its maximum. Since the DSSA⁺ is not complete, collisions may happen on rare occasions. The reward would be overwritten by

r_{c o l}

(

r_{c o l} ≪ 0

) if a collision occurs.

5.2.5. DQN Architecture

The DQN equipped with agent i, denoted by

D Q N_{i}

, takes

s_{i}^{t}

as its input and computes the Q-values for possible actions as its output. The values of (reshaped)

{CR}_{i}^{t} (\in R^{180})

in state

s_{i}^{t}

are supposed to have strong spatial correlation since they are computed based on current intentions of neighboring agents (Figure 1). Thus, as shown in Figure 2, the

{CR}_{i}^{t}

part of the input layer is connected to the Convolutional Neural Network (CNN) that has two convolutional layers with a kernel size of

3 \times 3

and one max-pooling layer with a kernel size of

2 \times 2

. The remaining two values for

{| θ_{i}^{t} - θ_{d_{i}}^{t} |, | v_{i}^{t} - v_{i}^{r e f} |} (\in R^{2})

of the input layer is concatenated with the

{CR}_{i}^{t}

, having been processed by the CNN and then flattened. After this concatenation, they are connected to the Fully Connected (FC) layers with 128 ReLUs in a hidden layer. The depth of the FC layers may change depending on experimental settings.

Figure 2. DQN architecture.

6. Experiment

To evaluate the performance of the DSSQ, we have conducted experiments using a discrete event simulator. In the simulator, each agent is represented by a circle with a diameter of 20, and can go back and forth freely in a two-dimensional continuous space of

800 \times 800

with the upper left corner as the origin. The maximum and minimum speeds,

v_{i}^{m a x}

and

v_{i}^{m i n}

, of each agent i are set to be 25 and 1, respectively. The detection range of each agent is a circle with a diameter of 500, and if there exist some other agents (neighbors) in that circle, the agent can communicate with them. The safety domain is defined for each agent as a circle with a diameter of 40, and if one of its neighbors penetrates this circle, we consider that a collision occurs. One episode in the DSSQ is considered to be the term from the time at which all agents start moving from their origins, until the time at which all of them successfully reach their destinations, or some agents collide with each other on the way.

T i m e W i n d o w

is set to be 20 time steps for computing a collision risk against each neighbor j by Equation (2). In other words, an agent can compute a collision risk considering up to 20 time steps ahead. Parameter p on a stochastic walk at line 7 in Algorithm 1 is set to 0.8. For

C_{1}

and

C_{2}

in Equation (6), we set a maximum degree of course change (45) to

C_{1}

, and a maximum speed change (8) to

C_{2}

. We also set

r_{c o l}

in Equation (10), a penalty when agent i collides, as

- 10.0

.

Recall that each agent i has its own deep Q-network

D Q N_{i}

and replay buffer

D_{i}

. In terms of learning, nothing is shared among agents. As mentioned before, the depth of FC layers changes depending on scenarios. It has one hidden layer for the scenarios of para2 and overtake3, which will be described below, but five hidden layers for the other scenarios. We use the

ε

-greedy policy with values of

ε

ranging from 0.9 to 0.0, decaying by a step of 0.1 for every 1000 episodes. Furthermore, we train the DQN by using mini-batch learning with a batch size of 32. The discount factor

γ

is set to 0.99. We should point out that although the values for hyper-parameters of the DSSA⁺ and the DQN can essentially be set arbitrarily, we have set empirically reasonable values through sufficient prior tuning.

We used the following five scenarios for evaluating the performance of the DSSQ. These include all three collision avoidance situations, head-on, crossing, and overtaking, defined by the COLREGs Rules, 13 through 15 in the maritime domain, and three of these, para2, overtake3, and cross16, were also used in the experiments in [21]:

para2: As shown in Figure 3a, there are two fully homogeneous agents that have their individual destinations located in front of the other agent. They go in parallel at first, but need to intersect somewhere in the middle of their routes. The reference speed $v_{i}^{r e f}$ of any agent is set to 25, which is equal to $v_{i}^{m a x}$ . The coordinates of the origins and destinations of all agents are shown in Table 1. Given these settings, the lower bound of the time step for every agent to reach its destination is 21.1.
overtake3: As shown in Figure 3b, there are three agents in a row going forward in the same direction that have their destinations located on the same straight line with an opposite order. The reference speed $v_{i}^{r e f}$ is 5 for agent 1, 12 for agent 2, and 25 for agent 3. This means the latter agents need to overtake the former agents somewhere in the middle of their courses. The coordinates of the origins and destinations of all agents are shown in Table 1. Given these settings, the average lower bound of the time step for each agent to reach its destination is 46.3.
face4: As shown in Figure 3c, there are four agents placed on the vertices of a square, each of which aims to reach the destination behind another agent on its diagonal. Clearly, their shortest paths intersect in the center of the square. The reference speed $v_{i}^{r e f}$ of any agent is set to 25. The coordinates of the origins and destinations of all agents are shown in Table 1. Given these settings, the lower bound of the time step for every agent to reach its destination is 37.7.
para4: As shown in Figure 3d, there are four agents that require similar but more complex crossing behavior than para2. The reference speed $v_{i}^{r e f}$ of any agent is set to 25. The coordinates of the origins and destinations of all agents are shown in Table 1. Given these settings, the lower bound of the time step for every agent to reach its destination is 22.4.
cross16: As shown in Figure 3e, there are 16 agents with all four groups of four agents crossing each other in the middle and heading toward the other sides of their starting points. The reference speed $v_{i}^{r e f}$ of any agent is set to 25. The coordinates of the origins and destinations of all agents are shown in Table 1. Given these settings, the lower bound of the time step for every agent to reach its destination is 23.0.

Note that we did not dare to use real AIS data in the maritime domain. This is due to the fact that the DSSA⁺ can generally solve a problem instance that consists of real positional and speed data obtained from AIS very easily [21]. It will not be a very interesting comparison using such data.

Figure 3. Snapshots of five scenarios: (a) para2, (b) overtake3, (c) face4, (d) para4, and (e) cross16. Black circles with numbers inside are agents, and yellow squares with the same numbers are their respective destinations. The small red lines of agents indicate their headings, which are highlighted in this figure by black arrows in front of them. The small red numbers on the top right of agents indicate their current speeds.

Table 1. Coordinates of the origins and destinations of all agents for each scenario. For each agent, “o[x, y]” and “d[x, y]” indicate the

x y

-coordinates of its origin and destination, respectively. Note that the upper left corner of the space is [0, 0].

Table 1. Coordinates of the origins and destinations of all agents for each scenario. For each agent, “o[x, y]” and “d[x, y]” indicate the

x y

-coordinates of its origin and destination, respectively. Note that the upper left corner of the space is [0, 0].

Scenario	Coordinates of Origins and Destinations
para2	agent 1: o[368, 700] d[432, 200], agent 2: o[432, 700] d[368, 200]
overtake3	agent 1: o[300, 500] d[500, 300], agent 2: o[200, 600] d[600, 200], agent 3: o[100, 700] d[700, 100]
face4	agent 1: o[100, 100] d[750, 750], agent 2: o[100, 700] d[750, 50], agent 3: o[700, 100] d[50, 750], agent 4: o[700, 700] d[50, 50]
para4	agent 1: o[368, 700] d[432, 200], agent 2: o[432, 700] d[368, 200], agent 3: o[304, 700] d[496, 200], agent 4: o[496, 700] d[304, 200]
cross16	agent 1: o[150, 250] d[700, 250], agent 2: o[150, 350] d[700, 350], agent 3: o[150, 450] d[700, 450], agent 4: o[150, 550] d[700, 550]
	agent 5: o[250, 650] d[250, 100], agent 6: o[350, 650] d[350, 100], agent 7: o[450, 650] d[450, 100], agent 8: o[550, 650] d[550, 100]
	agent 9: o[650, 250] d[100, 250], agent 10: o[650, 350] d[100, 350], agent 11: o[650, 450] d[100, 450], agent 12: o[650, 550] d[100, 550]
	agent 13: o[250, 150] d[250, 700], agent 14: o[350, 150] d[350, 700], agent 15: o[450, 150] d[450, 700], agent 16: o[550, 150] d[550, 700]

To our knowledge, our proposed method is unique, and there is no other algorithm to directly compare it with. Of course, we are aware that there are a series of ADMM-based distributed collision avoidance algorithms, but they have never had the perspective of improving the efficiency of collision avoidance actions by adjusting two collision avoidance methods (i.e., course change and speed change) separately, and have never provided any tools for this purpose. Therefore, we decided that the only opponent for performance comparison in our experiments would be the DSSA⁺, which explicitly has such tools (namely,

α

and

β

in Equation (3)).

We compare the performance of the DSSQ with those of three different versions of the original DSSA⁺, denoted by

{DSSA}_{0.1, 0.9}^{+}

,

{DSSA}_{0.5, 0.5}^{+}

, and

{DSSA}_{0.9, 0.1}^{+}

, where all of the agents set the same constant values (

0.1, 0.9

), (

0.5, 0.5

), and (

0.9, 0.1

) to (

α, β

), respectively, throughout an episode.

For each scenario, we ran each of these algorithms to measure the time step for each agent to reach its destination, and computed their average/variance over the agents for each run (episode). We made such episodes 100 times.

We should point out here that the best indicator for evaluating the sophistication of collision avoidance using both course and speed changes is the time it takes for each mobile agent to move to avoid a collision, rather than the distance traveled by each agent. For example, if a mobile agent is traveling straight toward its destination and slows down without changing course to avoid colliding with another mobile agent along the way, the distance traveled to the destination will not change, but the total travel time will increase by the amount of deceleration. On the other hand, if the agent changes course without changing its speed, the distance traveled to the destination will increase, and so will the total travel time. In both cases, it is time that is lost, not distance. Namely, which collision avoidance method is better should be measured by the total travel time, not the total travel distance.

The simulation environment is as follows: Intel Core i9-8950HK (2.90 GHz, 32 GB memory), NVIDIA Quadro P2000, Windows 10 Pro, Python version 3.8.10, and PyTorch 1.4.0.

We trained the DSSQ prior to the comparison to the DSSA⁺. Figure 4 shows the time profiles on average time steps over the agents for each scenario. We made such training five times. Each training continued until a total of 50,000 successful collision-free episodes were achieved for all scenarios except cross16, while it continued until a total of 35,000 successful collision-free episodes were achieved for cross16. In Figure 4, a solid blue line indicates the average over five runs, and a light blue band indicates the range of the worst and best values of five runs. These data were sampled and plotted every 500 episodes to improve the visibility of the graphs. As the episode progressed, the average time steps decreased at a roughly monotonic pace for all scenarios except cross16. On the other hand, for cross16, we observed an interesting behavior that was somewhat different from the other scenarios: the average time step rose initially in the early stages of training, and then gradually decreased to below the initial value. As a result, it was observed that the average time steps by the DSSQ largely converged to values of approximately 1.05 times for para2, 1.08 times for overtake3, 1.03 times for face4, 1.13 times for para4, and 1.14 times for cross16, relative to their theoretical lower bounds for the scenarios, respectively. Note that these theoretical lower bounds are the ones computed manually under the assumption that all agents do not avoid collisions, i.e., they pass each other as if no collisions occurred. Thus, the optimal average time step for a real collision-free path will always be larger than the theoretical lower bounds.

Figure 4. Training results of the DSSQ: time profiles on the average time steps over the agents for (a) para2, (b) overtake3, (c) face4, (d) para4, and (e) cross16. Note that training was conducted five times for each scenario. A solid blue line indicates the average over five runs, and a light blue band indicates the range of the worst and best values of five runs.

Table 2 compares the performance of the DSSQ and three versions of the DSSA⁺ in terms of the average time step over the agents. As can be seen, we can confirm that the final performance of the DSSQ is almost the same as the highest performance of the DSSA⁺. In other words, the action sequence of every agent finally obtained by the DSSQ hardly degrades the optimality of the system-wide performance.

Table 2. Average time step over the agents (averaged over 100 episodes).

In para2, face4, and cross16, since the distances between their origins and destinations of all agents are the same, and their reference speeds are also the same, the situations in which all of the agents are placed are essentially the same. We call such a scenario a “symmetric” scenario. In such a symmetric scenario, since there is no other factor that determines superiority or inferiority between agents, it is desirable that the number of time steps for each agent to spend in one episode is as close to the average as possible (i.e., the variance is as small as possible). In other words, in a symmetric scenario, it is desirable to avoid a situation in which many other agents can reach their destinations in the minimum time step, while only a few specific agents spend excessive time steps to avoid collisions. Table 3 shows the variance in time steps over the agents obtained by each algorithm. The numbers in the table indicate the maximum variance over 100 episodes for each scenario. In cross16, which is the most complex symmetric scenario, we can observe that the maximum variance by the DSSQ was significantly smaller, about one-seventh of the maximum variance by the best one of the DSSA⁺. This indicates that most of the agents in the DSSQ reached their respective destinations in such a near minimum average number of time steps, but that some agents in

{DSSA}_{0.5, 0.5}^{+}

took more time steps that significantly deviate from that minimum average. While in para2 and face4, which are simple symmetric scenarios, we can observe that the variances by the DSSQ were sufficiently small, but there was no such clear reduction compared to the best one of the DSSA⁺. These results suggest that the DSSQ has learned effective collision avoidance behaviors that the DSSA⁺ could never realize, especially in complex symmetric scenarios.

Table 3. Variance in time steps over the agents (maximum over 100 episodes).

On the other hand, since overtake3 and para4 are not symmetric scenarios, we do not believe that it makes much sense to compare the variance in time steps over the agents. However, we would like to point out that the variances in the DSSQ in overtake3 and para4 did not reach extremely large values seen in some versions of the DSSA⁺.

During the learning process, each agent aims to minimize the loss defined by Equation (7) while, if a collision occurs, they obtain a negative reward (

r_{c o l} = - 10

). As a result, a time-efficient collision avoidance behavior, which is indeed our goal, is obtained, but conversely, this means that the agent has been learning “risky” collision avoidance behavior, such as passing each other just before a collision occurs. However, we need to point out that such a risk could be controlled by adjusting the magnitude of the negative reward

r_{c o l}

. In our current experiment with

r_{c o l} = - 10

, the collision rate of the DSSQ decreased to almost zero after learning in most scenarios except overtake3, but in overtake3, collisions were observed relatively frequently, at approximately 10% of runs, even after learning. The magnitude of the negative reward given for collisions during learning appears to clearly affect the time efficiency and collision rate of resulting paths. We consider that a comprehensive evaluation of this effect remains a topic for future work.

7. Conclusions and Future Work

In this paper, we proposed the DSSQ, in which the agents can generate time-efficient collision-free paths while they learn independently whether to detour or change speeds depending on situations. We experimentally confirmed that the average time steps by the DSSQ largely converged to values of approximately 1.05 times for para2, 1.08 times for overtake3, 1.03 times for face4, 1.13 times for para4, and 1.14 times for cross16, relative to their theoretical lower bounds for the scenarios, respectively. This means that the final performance of the DSSQ is almost the same as the highest performance of the DSSA⁺ in terms of the average time step. Furthermore, we confirmed that the maximum variance by the DSSQ was significantly smaller, about one-seventh of the maximum variance by the best one of the DSSA⁺ in cross16, which is the most complex symmetric scenario. Based on these results, we think that it is safe to say that the DSSQ enables the agents to improve both their own time efficiency and the system-wide performance simultaneously.

Although the experiment was an abstract simulation, we think that it demonstrated that by repeatedly minimizing the total cost of all of the agents by the DSSA⁺ and minimizing each agent’s individual loss by DRL, it is possible to create a time-efficient distributed collision avoidance algorithm in terms of both the average and variance. This fact suggests that in distributed optimization, which can essentially be considered multi-objective optimization, it may be possible to design a desirable distributed algorithm bottom up by iteratively adjusting hyper-parameters via learning so that it can reduce the gap between system optimality and individual optimality. We believe this idea has the potential to become a new principle that can be applied to the general design of desirable algorithms for solving distributed optimization problems.

The primary focus of this paper is the development of a general-purpose algorithm for distributed collision avoidance, not its specific application to physical moving objects. However, if we consider applying the DSSA⁺ and the DSSQ to physical moving objects in the future, we think that at least the following two extensions are necessary. The first is to combine the DSSA⁺ and the DSSQ with a specific object motion model, such as the one with non-holonomic constraints [47]. This will enable more accurate prediction of the real-world motion of moving objects. The second is to combine them with a hydrodynamics model that predicts the flow of the field to take external factors, such as wind or waves, into account. This will enable more accurate prediction of the real-world motion of moving objects in complex environments. Although these extensions are beyond the scope of this work, we consider that these are important issues for the future.

As shown in this paper, the agents in the DSSQ learn independently, and there is no collaboration among individual agents during the learning process. However, it is beneficial, as our future work, to introduce some collaboration among agents for not only collision avoidance but also learning, like some techniques for Multi-Agent Reinforcement Learning [48].

Author Contributions

Conceptualization, K.H., K.G. and T.O.; methodology, K.H. and K.G.; software, K.H., K.G. and J.K.; validation, K.H., K.G. and J.K.; formal analysis, K.H., K.G. and J.K.; investigation, K.H. and J.K.; resources, K.H.; data curation, K.H. and J.K.; writing—original draft preparation, K.H. and K.G.; writing—review and editing, K.H., J.K., T.O. and D.K.; visualization, K.H., K.G. and J.K.; supervision, K.H., T.O. and D.K.; project administration, K.H.; funding acquisition, K.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by JSPS KAKENHI grant number 23K24903.

Data Availability Statement

The datasets presented in this article are not readily available because the data are part of an ongoing study. Requests to access the datasets should be directed to the corresponding author.

Conflicts of Interest

Author Kazuma Gohara is currently employed by the company Japan Airlines Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Hennes, D.; Claes, D.; Meeussen, W.; Tuyls, K. Multi-Robot Collision Avoidance with Localization Uncertainty. In Proceedings of the 11th International Conference on Autonomous Agents and Multiagent Systems (AAMAS-2012), Valencia, Spain, 4–8 June 2012; pp. 147–154. [Google Scholar]
Huang, Y.; Chen, L.; Chen, P.; Negenborn, R.R.; van Gelder, P. Ship collision avoidance methods: State-of-the-art. Saf. Sci. 2020, 121, 451–473. [Google Scholar] [CrossRef]
Wang, Q.; Phillips, C. Cooperative collision avoidance for multi-vehicle systems using reinforcement learning. In Proceedings of the 2013 18th International Conference on Methods Models in Automation Robotics (MMAR), Miedzyzdroje, Poland, 26–29 August 2013; pp. 98–102. [Google Scholar]
Chung, S.J.; Paranjape, A.A.; Dames, P.; Shen, S.; Kumar, V. A Survey on Aerial Swarm Robotics. IEEE Trans. Robot. 2018, 34, 837–855. [Google Scholar] [CrossRef]
Alonso-Mora, J.; Breitenmoser, A.; Rufli, M.; Beardsley, P.; Siegwart, R. Optimal reciprocal collision avoidance for multiple non-holonomic robots. In Distributed Autonomous Robotic Systems: The 10th International Symposium; Springer: Berlin/Heidelberg, Germany, 2013; pp. 203–216. [Google Scholar]
Kuderer, M.; Kretzschmar, H.; Sprunk, C.; Burgard, W. Feature-Based Prediction of Trajectories for Socially Compliant Navigation. In Robotics: Science and Systems VIII; The MIT Press: Cambridge, MA, USA, 2013; pp. 193–200. [Google Scholar]
Phillips, M.; Likhachev, M. SIPP: Safe interval path planning for dynamic environments. In Proceedings of the 2011 IEEE International Conference on Robotics and Automation (ICRA-2011), Shanghai, China, 9–13 May 2011; pp. 5628–5635. [Google Scholar]
van den Berg, J.; Guy, S.J.; Lin, M.; Manocha, D. Reciprocal n-Body Collision Avoidance. In Robotics Research; Pradalier, C., Siegwart, R., Hirzinger, G., Eds.; Springer: Berlin/Heidelberg, Germany, 2011; pp. 3–19. [Google Scholar]
Chen, Y.F.; Liu, M.; Everett, M.; How, J.P. Decentralized non-communicating multiagent collision avoidance with deep reinforcement learning. In Proceedings of the 2017 IEEE International Conference on Robotics and Automation (ICRA-2017), Singapore, 29 May–3 June 2017; pp. 285–292. [Google Scholar]
Everett, M.; Chen, Y.F.; How, J.P. Motion planning among dynamic, decision-making agents with deep reinforcement learning. In Proceedings of the 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS-2018), Madrid, Spain, 1–5 October 2018; pp. 3052–3059. [Google Scholar]
Fan, T.; Long, P.; Liu, W.; Pan, J. Distributed multi-robot collision avoidance via deep reinforcement learning for navigation in complex scenarios. Int. J. Robot. Res. 2020, 39, 856–892. [Google Scholar] [CrossRef]
Felner, A.; Stern, R.; Shimony, S.E.; Boyarski, E.; Goldenberg, M.; Sharon, G.; Sturtevant, N.; Wagner, G.; Surynek, P. Search-Based Optimal Solvers for the Multi-Agent Pathfinding Problem: Summary and Challenges. In Proceedings of the Tenth International Symposium on Combinatorial Search (SoCS-2017), Pittsburgh, PA, USA, 16–17 June 2017; pp. 29–37. [Google Scholar]
Tang, S.; Thomas, J.; Kumar, V. Hold or take Optimal Plan (HOOP): A quadratic programming approach to multi-robot trajectory generation. Int. J. Robot. Res. 2018, 37, 1062–1084. [Google Scholar] [CrossRef]
Kim, D.G.; Hirayama, K.; Park, G.K. Collision Avoidance in Multiple-Ship Situations by Distributed Local Search. J. Adv. Comput. Intell. Intell. Informatics 2014, 18, 839–848. [Google Scholar] [CrossRef]
Kim, D.; Hirayama, K.; Okimoto, T. Ship Collision Avoidance by Distributed Tabu Search. TransNav Int. J. Mar. Navig. Saf. Sea Transp. 2015, 9, 23–29. [Google Scholar] [CrossRef]
Kim, D.; Hirayama, K.; Okimoto, T. Distributed Stochastic Search Algorithm for Multi-ship Encounter Situations. J. Navig. 2017, 70, 699–718. [Google Scholar] [CrossRef]
Zheng, H.; Negenborn, R.R.; Lodewijks, G. Fast ADMM for Distributed Model Predictive Control of Cooperative Waterborne AGVs. IEEE Trans. Control Syst. Technol. 2017, 25, 1406–1413. [Google Scholar] [CrossRef]
Chen, L.; Hopman, H.; Negenborn, R.R. Distributed model predictive control for vessel train formations of cooperative multi-vessel systems. Transp. Res. Part C Emerg. Technol. 2018, 92, 101–118. [Google Scholar] [CrossRef]
Chen, L.; Negenborn, R.R.; Hopman, H. Intersection Crossing of Cooperative Multi-vessel Systems. IFAC-PapersOnLine 2018, 51, 379–385. [Google Scholar] [CrossRef]
Ferranti, L.; Negenborn, R.R.; Keviczky, T.; Alonso-Mora, J. Coordination of Multiple Vessels Via Distributed Nonlinear Model Predictive Control. In Proceedings of the 2018 European Control Conference (ECC), Limassol, Cyprus, 12–15 June 2018; pp. 2523–2528. [Google Scholar]
Hirayama, K.; Miyake, K.; Shiota, T.; Okimoto, T. DSSA⁺: Distributed Collision Avoidance Algorithm in an Environment where Both Course and Speed Changes are Allowed. TransNav Int. J. Mar. Navig. Saf. Sea Transp. 2019, 13, 117–124. [Google Scholar] [CrossRef]
Li, S.; Liu, J.; Negenborn, R.R. Distributed coordination for collision avoidance of multiple ships considering ship maneuverability. Ocean Eng. 2019, 181, 212–226. [Google Scholar] [CrossRef]
Akdağ, M.; Fossen, T.I.; Johansen, T.A. Collaborative Collision Avoidance for Autonomous Ships Using Informed Scenario-Based Model Predictive Control. IFAC-PapersOnLine 2022, 55, 249–256. [Google Scholar] [CrossRef]
Tran, H.A.; Johansen, T.A.; Negenborn, R.R. Parallel distributed collision avoidance with intention consensus based on ADMM. IFAC-PapersOnLine 2024, 58, 302–309. [Google Scholar] [CrossRef]
Zhang, W.; Wang, G.; Xing, Z.; Wittenburg, L. Distributed stochastic search and distributed breakout: Properties, comparison and applications to constraint optimization problems in sensor networks. Artif. Intell. 2005, 161, 55–87. [Google Scholar] [CrossRef]
Boyd, S.; Parikh, N.; Chu, E.; Peleato, B.; Eckstein, J. Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers. Found. Trends®Mach. Learn. 2011, 3, 1–122. [Google Scholar]
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef] [PubMed]
CBSMornings. Synchronized Walking Becomes Staple at Japanese University. 2021. Available online: https://www.youtube.com/watch?v=uDgEQGsh7Qs (accessed on 22 August 2025).
Shen, H.; Hashimoto, H.; Matsuda, A.; Taniguchi, Y.; Terada, D.; Guo, C. Automatic collision avoidance of multiple ships based on deep Q-learning. Appl. Ocean Res. 2019, 86, 268–288. [Google Scholar] [CrossRef]
Godoy, J.E.; Karamouzas, I.; Guy, S.J.; Gini, M. Implicit coordination in crowded multi-agent navigation. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI-2016), Phoenix, AZ, USA, 12–17 February 2016; pp. 2487–2493. [Google Scholar]
Purwin, O.; D’Andrea, R.; Lee, J.W. Theory and implementation of path planning by negotiation for decentralized agents. Robot. Auton. Syst. 2008, 56, 422–436. [Google Scholar] [CrossRef]
Zheng, Y.; Li, S.E.; Li, K.; Borrelli, F.; Hedrick, J.K. Distributed Model Predictive Control for Heterogeneous Vehicle Platoons Under Unidirectional Topologies. IEEE Trans. Control Syst. Technol. 2017, 25, 899–910. [Google Scholar] [CrossRef]
Wang, P.; Deng, H.; Zhang, J.; Wang, L.; Zhang, M.; Li, Y. Model Predictive Control for Connected Vehicle Platoon Under Switching Communication Topology. IEEE Trans. Intell. Transp. Syst. 2022, 23, 7817–7830. [Google Scholar] [CrossRef]
Qiang, Z.; Dai, L.; Chen, B.; Xia, Y. Distributed Model Predictive Control for Heterogeneous Vehicle Platoon With Inter-Vehicular Spacing Constraints. IEEE Trans. Intell. Transp. Syst. 2023, 24, 3339–3351. [Google Scholar] [CrossRef]
Du, G.; Zou, Y.; Zhang, X.; Fan, J.; Sun, W.; Li, Z. Efficient Motion Control for Heterogeneous Autonomous Vehicle Platoon Using Multilayer Predictive Control Framework. IEEE Internet Things J. 2024, 11, 38273–38290. [Google Scholar] [CrossRef]
Hirayama, K.; Yokoo, M. Distributed Partial Constraint Satisfaction Problem. In Proceedings of the Third International Conference on Principles and Practice of Constraint Programming (CP-1997), Linz, Austria, 29 October–1 November 1997; pp. 222–236. [Google Scholar]
Petcu, A.; Faltings, B. A Scalable Method for Multiagent Constraint Optimization. In Proceedings of the 19th International Joint Conference on Artificial Intelligence (IJCAI-2005), Edinburgh, UK, 30 July–5 August 2005; pp. 266–271. [Google Scholar]
Gershman, A.; Meisels, A.; Zivan, R. Asynchronous Forward Bounding for Distributed COPs. J. Artif. Intell. Res. 2009, 34, 61–88. [Google Scholar] [CrossRef]
Pertzovskiy, A.; Zivan, R.; Agmon, N. Collision Avoiding Max-Sum for Mobile Sensor Teams. J. Artif. Intell. Res. 2024, 79, 1281–1311. [Google Scholar] [CrossRef]
Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction, 2nd ed.; The MIT Press: Cambridge, MA, USA, 2018. [Google Scholar]
Watkins, C.J.; Dayan, P. Technical Note: Q-Learning. Mach. Learn. 1992, 8, 279–292. [Google Scholar] [CrossRef]
Van Hasselt, H.; Guez, A.; Silver, D. Deep reinforcement learning with double q-learning. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI-2016), Phoenix, AZ, USA, 12–17 February 2016; pp. 2094–2100. [Google Scholar]
Schaul, T.; Quan, J.; Antonoglou, I.; Silver, D. Prioritized experience replay. arXiv 2015, arXiv:1511.05952. [Google Scholar]
De Asis, K.; Hernandez-Garcia, J.F.; Holland, G.Z.; Sutton, R.S. Multi-step reinforcement learning: A unifying algorithm. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-2018), New Orleans, LA, USA, 2–7 February 2018; pp. 2902–2909. [Google Scholar]
Wang, Z.; Schaul, T.; Hessel, M.; Hasselt, H.; Lanctot, M.; Freitas, N. Dueling network architectures for deep reinforcement learning. In Proceedings of the 33nd International Conference on Machine Learning (ICML-2016), New York, NY, USA, 19–24 June 2016; pp. 1995–2003. [Google Scholar]
Hessel, M.; Modayil, J.; Van Hasselt, H.; Schaul, T.; Ostrovski, G.; Dabney, W.; Horgan, D.; Piot, B.; Azar, M.; Silver, D. Rainbow: Combining improvements in deep reinforcement learning. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-2018), New Orleans, LA, USA, 2–7 February 2018; pp. 3215–3222. [Google Scholar]
Fan, J.; Zhang, X.; Zheng, K.; Zou, Y.; Zhou, N. Hierarchical path planner combining probabilistic roadmap and deep deterministic policy gradient for unmanned ground vehicles with non-holonomic constraints. J. Frankl. Inst. 2024, 361, 106821. [Google Scholar] [CrossRef]
Wen, M.; Kuba, J.; Lin, R.; Zhang, W.; Wen, Y.; Wang, J.; Yang, Y. Multi-agent reinforcement learning is a sequence modeling problem. Adv. Neural Inf. Process. Syst. 2022, 35, 16509–16521. [Google Scholar]

Figure 1. Possible intentions (

D o m

). The large red dot in the center indicates an intention of

(Δ θ = 0, Δ v = 0)

, which means to maintain the current course and speed. Other red dots also indicate other possible intentions. Note that

p_{d_{i}}

is the destination of agent i, and

θ_{d_{i}}^{t}

is the absolute angle that corresponds to the direction toward

p_{d_{i}}

.

Figure 2. DQN architecture.

Figure 4. Training results of the DSSQ: time profiles on the average time steps over the agents for (a) para2, (b) overtake3, (c) face4, (d) para4, and (e) cross16. Note that training was conducted five times for each scenario. A solid blue line indicates the average over five runs, and a light blue band indicates the range of the worst and best values of five runs.

Table 2. Average time step over the agents (averaged over 100 episodes).

	Para2	Overtake3	Face4	Para4	Cross16
${DSSA}_{0.1, 0.9}^{+}$	25.0	51.4	38.1	39.1	27.2
${DSSA}_{0.9, 0.1}^{+}$	22.8	59.3	39.6	25.0	33.2
${DSSA}_{0.5, 0.5}^{+}$	25.0	50.0	38.2	34.3	26.2
DSSQ	22.1	50.2	38.7	25.2	26.3
Lower bound	21.1	46.3	37.7	22.4	23.0

Table 3. Variance in time steps over the agents (maximum over 100 episodes).

	Para2	Overtake3	Face4	Para4	Cross16
${DSSA}_{0.1, 0.9}^{+}$	20.33	141.56	0.19	6704.19	487.43
${DSSA}_{0.9, 0.1}^{+}$	1.00	1170.67	4.25	4.69	437.18
${DSSA}_{0.5, 0.5}^{+}$	20.25	169.56	0.19	1529.69	280.73
DSSQ	1.00	169.56	3.00	5.69	40.06

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Learning System-Optimal and Individual-Optimal Collision Avoidance Behaviors by Autonomous Mobile Agents

Abstract

1. Introduction

3. Distributed Stochastic Search Algorithm

3.1. Framework

3.2. DSSA⁺

4. Deep Reinforcement Learning

4.1. Q-Learning

4.2. Deep Q Network

5. Approach

5.1. Basic Idea

5.2. Distributed Stochastic Search Algorithm with Deep Q-Network (DSSQ)

5.2.1. Overview of DSSQ

5.2.2. State Space

5.2.3. Action Space

5.2.4. Reward Function

5.2.5. DQN Architecture

6. Experiment

7. Conclusions and Future Work

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Article Metrics

Citations

Article Access Statistics

Learning System-Optimal and Individual-Optimal Collision Avoidance Behaviors by Autonomous Mobile Agents

Abstract

1. Introduction

2. Related Work

3. Distributed Stochastic Search Algorithm

3.1. Framework

3.2. DSSA+

4. Deep Reinforcement Learning

4.1. Q-Learning

4.2. Deep Q Network

5. Approach

5.1. Basic Idea

5.2. Distributed Stochastic Search Algorithm with Deep Q-Network (DSSQ)

5.2.1. Overview of DSSQ

5.2.2. State Space

5.2.3. Action Space

5.2.4. Reward Function

5.2.5. DQN Architecture

6. Experiment

7. Conclusions and Future Work

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Article Metrics

Citations

Article Access Statistics

3.2. DSSA⁺