# Strategies for Scaleable Communication and Coordination in Multi-Agent (UAV) Systems

^{1}

^{2}

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Reinforcement Learning

#### 2.1. Overview

#### 2.2. The System State Space

#### 2.3. Rewards

#### 2.4. Actions, Policies, and Value Functions

#### 2.5. Value Iteration vs Policy Iteration

#### 2.6. The Deadly Triad

## 3. Multi-Agent Planning and Control

#### 3.1. Locally Interactive State Spaces

#### 3.2. Soft Locally Interactive Structure

#### 3.3. Performance Metrics

#### 3.4. Hierarchical Reinforcement Learning

#### 3.5. Sequential Decision-Making

## 4. Multi-Agent Communication and Networking

#### 4.1. Routing Protocols in Ad-Hoc Networks

#### 4.2. Clustering Protocols

#### 4.3. Hierarchical Multi-Level Clustering

- (O1)
- Cluster-Head Reassignment: Sometimes a designated cluster-head is no longer suitable for the role. This occurs when another member is closer to the centroid of the cluster, or the cluster-head “despawns” (i.e., powers-down or self-disables). Prior to the trigger-event, the cluster-head transfers the cluster token (and the associated responsibilities) to the member closest to the centroid.
- (O2)
- Member Transfer: An agent may drift closer to the centroid of another cluster. The “old” cluster-head then initiates a transfer to the new cluster by alerting the higher-level head of the sub-tree. Upon approval, the old token deletes its pointers to the departing agent. The new token adds pointers to the joining agent, which does the same in reciprocity, as depicted in Figure 7.
- (O3)
- Cluster Split: Level-k clusters with sizes that exceed a certain threshold may split in two. The level-$k+1$ cluster-head creates a new level-k token and designates the new level-k cluster-head as an “idle” member (an agent in the sub-tree that is not the head of any cluster) closest to the centroid. The new cluster includes all members of the old cluster closest to the new cluster-head. The token pointers are adjusted appropriately.
- (O4)
- Cluster Assimilation: Clusters can also shrink when their members leave or de-spawn. If a level-k cluster-size falls below a threshold, the level-k cluster-head initiates an assimilation request, in which the head of the subtree moves the members to the closest level-k clusters. The token of the old level-k cluster is then destroyed.
- (O5)
- Boss Demotion: As the system approaches the end of its operating lifetime, agents de-spawn faster than they re-spawn. The MLC tree then shrinks until the head of the tree, (i.e., the boss) has only one member in its cluster. The boss destroys its token to make this member the new boss and eliminate the superfluous extra level.

#### 4.4. Belief Propagation in Multi-Level Clustering

## 5. A Proposed Control/Communication Architecture for Multi-Agent Systems

#### 5.1. Multi-Agent Coordination and Communication in the Literature

#### 5.2. Developing a Proof-Of-Concept: A Structured Approach

- (M1)
- The Single-Agent, Stationary Interest-Points Scenario. The agent visits a sequence of stationary interest-points on a two-dimensional grid, using the shortest path. Training is performed with basic RL, under the assumption the agent has full-state state information.For a single stationary interest-point, this scenario reduces to the grid-world problem, a well-known example in RL. The grid-world problem becomes much harder with two interest-points. A scenario with multiple interest-points corresponds to the TSP, where the optimal solution requires an exhaustive search over all possible trajectories. The next section will explain why this milestone, even with two interest-points, benefits from advanced strategies like hierarchical RL.
- (M2)
- The Multi-Agent, Stationary Interest-Points Scenario. A system of multiple agents finds an approximately optimal set of trajectories that visit stationary interest-points on a two-dimensional grid. Training is performed using basic RL under the assumption that each agent now observes the full system state. This milestone incorporates macro-actions (i.e., “Agents 1, 4, and 5 move to sub-region A”) to keep the distribution of agents and interest-points in equilibrium. It focuses on policies reasonably achievable by training Q-tables, excluding moving interest-points and the large scale activity of many agents (i.e., dynamic sky-way networks).
- (M3)
- The Multi-Agent Mobile Interest-Points Scenario. A system of agents tracks moving interest-points on a two-dimensional grid. Training is performed using deep RL, under the assumption that each agent observes the full system state. The objective is to verify that neural networks can model tracking policies for various interest-point trajectories. To achieve this milestone, it is logical to start with one agent and one stationary interest-point, where the reward function has a simple discernible structure. The next step is to let the interest-point move predictably (i.e., around a circle at constant velocity). Next, add two interest-points. Then incrementally increase the complexity of the system, verifying at each stage that the extra complexity is captured in training.
- (M4)
- The Scaleable Multi-Agent Scenario. The system maintains approximately optimal performance as the number of agents grows large. Training is performed using deep RL, but each agent relies on its local state and a compressed/aggregated version of the global state space. Macro-actions now include unmanned traffic management (UTM) to support large-scale high-traffic drone transportation. The “dynamic sky-way network” redistributes agents over the grid to keep the distribution of agents and interest-points in equilibrium, and consists of a few high-throughput airways for long distances and many low-throughput airways for finer redistribution over short distances. The shape of the network adapts to the distribution of interest-points in real-time. Agents collectively determine the trajectories of airways that traverse their respective sub-regions.
- (M5)
- The Multi-Agent Multi-Level Clustering Scenario. Agents rely on a multi-level clustering scheme to disseminate/aggregate their local state information and infer a common view of the global state. Cluster-heads compress state information to ensure the information throughput has sub-linear scaling. Each cluster organizes local activity and cooperates with higher-level clusters to ensure the system behaves in a coordinated manner. The system supports essential distributed tasks like cluster-formation, cluster-maintenance, and synchronization. The clustering algorithm is fully integrated into the RL training process developed in the previous milestones.

#### 5.3. A Case-Study of Milestone M1

## 6. Conclusions

## Author Contributions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Conflicts of Interest

## Abbreviations

AUV | Autonomous underwater vehicle |

DV | Distance vector |

F-MDP | Factored Markov decision process |

F-POMDP | Factored partially observable Markov decision process |

HIRO | Hierarchical reinforcement learning with off-policy correction |

HRL | Hierarchical reinforcement learning |

LS | Link state |

MANETs | Mobile ad-hoc networks |

MDP | Markov decision process |

MLC | Multi-level clustering |

POMDP | Partially observable Markov decision process |

RL | Reinforcement learning |

TSP | Travelling salesman problem |

UAV | Unmanned aerial vehicle |

UTM | Unmanned aerial traffic management |

## References

- Haksar, R.N.; Schwager, M. Distributed Deep Reinforcement Learning for Fighting Forest Fires with a Network of Aerial Robots. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain, 1–5 October 2018; pp. 1067–1074. [Google Scholar] [CrossRef]
- Thiels, C.A.; Aho, J.M.; Zietlow, S.P.; Jenkins, D.H. Use of Unmanned Aerial Vehicles for Medical Product Transport. Air Med J.
**2015**, 34, 104–108. [Google Scholar] [CrossRef] - Wallar, A.; Plaku, E.; Sofge, D.A. Reactive Motion Planning for Unmanned Aerial Surveillance of Risk-Sensitive Areas. IEEE Trans. Autom. Sci. Eng.
**2015**, 12, 969–980. [Google Scholar] [CrossRef] - Deruyck, M.; Wyckmans, J.; Joseph, W.; Martens, L. Designing UAV-aided emergency networks for large-scale disaster scenarios. EURASIP J. Wirel. Commun. Netw.
**2018**, 2018, 79. [Google Scholar] [CrossRef] - Greenwood, W.W.; Lynch, J.P.; Zekkos, D. Applications of UAVs in Civil Infrastructure. J. Infrastruct. Syst.
**2019**, 25, 04019002. [Google Scholar] [CrossRef] - Serna, J.G.; Vanegas, F.; Gonzalez, F.; Flannery, D. A Review of Current Approaches for UAV Autonomous Mission Planning for Mars Biosignatures Detection. In Proceedings of the IEEE Aerospace Conference, Big Sky, MT, USA, 7–14 March 2020; pp. 1–15. [Google Scholar] [CrossRef]
- Guestrin, C.; Koller, D.; Parr, R. Multiagent Planning with Factored MDPs. In Advances in Neural INFORMATION Processing Systems Proceedings of the Annual Conference on Neural Information Processing Systems (NIPS), 3–8 December 2001; Dietterich, T., Becker, S., Ghahramani, Z., Eds.; MIT Press: Cambridge, MA, USA, 2001; Volume 14. [Google Scholar]
- Oliehoek, F.A.; Whiteson, S.; Spaan, M.T. Approximate solutions for factored Dec-POMDPs with many agents. In Proceedings of the Belgian/Netherlands Artificial Intelligence Conference, Delft, The Netherlands, 7 July 2013; pp. 340–341. [Google Scholar]
- Nair, R.; Tambe, M.; Roth, M.; Yokoo, M. Communications for improving policy computation in distributed POMDPs. In Proceedings of the 3rd International Joint Conference on Autonomous Agents and Multiagent Systems, (AAMAS), New York, NY, USA, 19–23 July 2004; pp. 1098–1105. [Google Scholar]
- Roth, M.; Simmons, R.; Veloso, M. Exploiting Factored Representations for Decentralized Execution in Multiagent Teams. In Proceedings of the AAMAS ’07: Proceedings of the 6th international Joint Conference on Autonomous Agents and Multiagent Systems, Honolulu, HI, USA, 14–18 May 2007; Association for Computing Machinery: New York, NY, USA, 2007. [Google Scholar] [CrossRef]
- Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature
**2015**, 518, 529–533. [Google Scholar] [CrossRef] [PubMed] - Silver, D.; Hubert, T.; Schrittwieser, J.; Antonoglou, I.; Lai, M.; Guez, A.; Lanctot, M.; Sifre, L.; Kumaran, D.; Graepel, T.; et al. A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play. Science
**2018**, 362, 1140–1144. [Google Scholar] [CrossRef] [PubMed] - Sutton, R.S.; McAllester, D.; Singh, S.; Mansour, Y. Policy Gradient Methods for Reinforcement Learning with Function Approximation. In Advances in Neural Information Processing Systems Proceedings of the Annual Conference on Neural Information Processing Systems (NIPS); Solla, S., Leen, T., Müller, K., Eds.; MIT Press: Cambridge, MA, USA, 1999; Volume 12. [Google Scholar]
- Mnih, V.; Badia, A.P.; Mirza, M.; Graves, A.; Lillicrap, T.; Harley, T.; Silver, D.; Kavukcuoglu, K. Asynchronous Methods for Deep Reinforcement Learning. In International Conference on Machine Learning Proceedings of the 33rd International Conference on Machine Learning, New York, NY, USA, 20–22 June 2016; Balcan, M.F., Weinberger, K.Q., Eds.; PMLR: New York, New York, USA, 2016; Volume 48, pp. 1928–1937. [Google Scholar]
- Watkins, C.J.C.H.; Dayan, P. Q-learning. Mach. Learn.
**1992**, 8, 279–292. [Google Scholar] [CrossRef] - Melo, F.; Ribeiro, I. Convergence of Q-learning with linear function approximation. In Proceedings of the 2007 European Control Conference, ECC 2007, Kos, Greece, 2–5 July 2007. [Google Scholar]
- van Hasselt, H.; Doron, Y.; Strub, F.; Hessel, M.; Sonnerat, N.; Modayil, J. Deep Reinforcement Learning and the Deadly Triad. arXiv
**2018**. [Google Scholar] [CrossRef] - Lillicrap, T.P.; Hunt, J.J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; Wierstra, D. Continuous Control with Deep Reinforcement Learning. arXiv
**2015**, arXiv:1509.02971. [Google Scholar] [CrossRef] - Fujimoto, S.; van Hoof, H.; Meger, D. Addressing Function Approximation Error in Actor-Critic Methods. arXiv
**2018**, arXiv:1802.09477. [Google Scholar] [CrossRef] - Koller, D.; Parr, R. Computing Factored Value Functions for Policies in Structured MDPs. In Proceedings of the International Joint Conferences on Artificial Intelligence Organization (IJCAI), Stockholm, Sweden, August 1999; pp. 1332–1339. [Google Scholar]
- Yang, Y.; Luo, R.; Li, M.; Zhou, M.; Zhang, W.; Wang, J. Mean Field Multi-Agent Reinforcement Learning. arXiv
**2018**. [Google Scholar] [CrossRef] - Barto, A.G.; Mahadevan, S. Recent Advances in Hierarchical Reinforcement Learning. Discret. Event Dyn. Syst.
**2003**, 13, 341–379. [Google Scholar] [CrossRef] - Dietterich, T.G. Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition. arXiv
**1999**, arXiv:cs/9905014. [Google Scholar] [CrossRef] - Bacon, P.L.; Harb, J.; Precup, D. The Option-Critic Architecture. In Proceedings of the AAAI-17: Thirty-First AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017. [Google Scholar] [CrossRef]
- Bagaria, A.; Konidaris, G. Option Discovery using Deep Skill Chaining. In Proceedings of the 8th International Conference on Learning Representations, Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]
- Chunduru, R.; Precup, D. Attention Option-Critic. arXiv
**2022**, arXiv:2201.02628. [Google Scholar] [CrossRef] - Kamat, A.; Precup, D. Diversity-Enriched Option-Critic. arXiv
**2020**, arXiv:2011.02565. [Google Scholar] [CrossRef] - Nachum, O.; Gu, S.; Lee, H.; Levine, S. Data-Efficient Hierarchical Reinforcement Learning. arXiv
**2018**, arXiv:1805.08296. [Google Scholar] [CrossRef] - Nachum, O.; Gu, S.; Lee, H.; Levine, S. Near-Optimal Representation Learning for Hierarchical Reinforcement Learning. arXiv
**2018**, arXiv:1810.01257. [Google Scholar] [CrossRef] - Levy, A.; Konidaris, G.; Platt, R.; Saenko, K. Learning Multi-Level Hierarchies with Hindsight. arXiv
**2017**, arXiv:1712.00948. [Google Scholar] [CrossRef] - Beyret, B.; Shafti, A.; Faisal, A.A. Dot-to-Dot: Explainable Hierarchical Reinforcement Learning for Robotic Manipulation. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Macau, SAR, China, 3–8 November 2019; pp. 5014–5019. [Google Scholar] [CrossRef]
- Kulkarni, T.D.; Narasimhan, K.; Saeedi, A.; Tenenbaum, J. Hierarchical Deep Reinforcement Learning: Integrating Temporal Abstraction and Intrinsic Motivation. In Advances in Neural Information Processing Systems, Proceedings of the Annual Conference on Neural Information Processing Systems (NIPS),Barcelona, Spain, 5–10 December 2016; Lee, D., Sugiyama, M., Luxburg, U., Guyon, I., Garnett, R., Eds.; Curran Associates, Inc.: New York, NY, USA, 2016; Volume 29. [Google Scholar]
- Boutilier, C. Sequential Optimality and Coordination in Multiagent Systems. In Proceedings of the IJCAI’99 16th International Joint Conference on Artifical Intelligence, Stockholm, Sweden, 31 July–6 August 1999; Morgan Kaufmann Publishers Inc.: San Francisco, CA, USA, 1999; Volume 1, pp. 478–485. [Google Scholar]
- Emery-montemerlo, R.; Gordon, G.; Schneider, J.; Thrun, S. Game theoretic control for robot teams. In Proceedings of the IEEE International Conference on Robotics and Automation, Barcelona, Spain, 18–22 April 2005. [Google Scholar]
- Ponniah, J.; Hu, Y.C.; Kumar, P.R. A Clean Slate Approach to Secure Wireless Networking. Found. Trends Netw.
**2015**, 9, 1–105. [Google Scholar] [CrossRef] - Gupta, P.; Kumar, P. The capacity of wireless networks. IEEE Trans. Inf. Theory
**2000**, 46, 388–404. [Google Scholar] [CrossRef] - McQuillan, J.M.; Richer, I.; Rosen, E.C. The New Routing Algorithm for the ARPANET. IEEE Trans. Commun.
**1980**, 28. [Google Scholar] [CrossRef] - Moy, J. OSPF Version 2, RFC 2328. 1998. [CrossRef]
- Hedrick, C. Routing Information Protocol, RFC 1058. 1988. [CrossRef]
- Perkins, C.E.; Bhagwat, P. Highly Dynamic Destination-Sequenced Distance-Vector Routing (DSDV) for Mobile Computers. SIGCOMM Comput. Commun. Rev.
**1994**, 24, 234–244. [Google Scholar] [CrossRef] - Jacquet, P.; Muhlethaler, P.; Clausen, T.; Laouiti, A.; Qayyum, A.; Viennot, L. Optimized link state routing protocol for ad hoc networks. In Proceedings of the IEEE International Multi Topic Conference (INMIC), Technology for the 21st Century, Lahore, Pakistan, 30 December 2001; pp. 62–68. [Google Scholar] [CrossRef][Green Version]
- Perkins, C.; Belding-Royer, E.; Das, S. Optimized Link State Routing Protocol (OLSR); RFC 3561; 2003. Available online: https://www.rfc-editor.org/rfc/pdfrfc/rfc3561.txt.pdf (accessed on 27 June 2022).
- Perkins, C.; Royer, E. Ad-hoc on-demand distance vector routing. In Proceedings of the 2nd IEEE Workshop on Mobile Computing Systems and Applications (WMCSA), New Orleans, LA, USA, 25–26 February 1999; pp. 90–100. [Google Scholar] [CrossRef]
- Ephremides, A.; Wieselthier, J.; Baker, D. A design concept for reliable mobile radio networks with frequency hopping signaling. Proc. IEEE
**1987**, 75, 56–73. [Google Scholar] [CrossRef] - Daeinabi, A.; Pour Rahbar, A.G.; Khademzadeh, A. VWCA: An efficient clustering algorithm in vehicular ad hoc networks. J. Netw. Comput. Appl.
**2011**, 34, 207–222. [Google Scholar] [CrossRef] - Rawashdeh, Z.Y.; Mahmud, S.M. A novel algorithm to form stable clusters in vehicular ad hoc networks on highways. Eurasip J. Wirel. Commun. Netw.
**2012**, 2012, 15. [Google Scholar] [CrossRef] - Hassanabadi, B.; Shea, C.; Zhang, L.; Valaee, S. Clustering in Vehicular Ad Hoc Networks using Affinity Propagation. Ad Hoc Networks
**2014**, 13, 535–548. [Google Scholar] [CrossRef] - Vodopivec, S.; Bester, J.; Kos, A. A Survey on Clustering Algorithms for Vehicular Ad-Hoc Networks. In Proceedings of the 35th International Conference on Telecommunications and Signal Processing (TSP), Prague, Czech Republic, 3–4 July 2012; pp. 52–56. [Google Scholar] [CrossRef]
- Lundelius, J.; Lynch, N. An upper and lower bound for clock synchronization. Inf. Control.
**1984**, 62, 190–204. [Google Scholar] [CrossRef] - Giridhar, A.; Kumar, P. Distributed Clock Synchronization over Wireless Networks: Algorithms and Analysis. In Proceedings of the 45th IEEE Conference on Decision and Control, San Diego, CA, USA, 13–15 December 2006; pp. 4915–4920. [Google Scholar] [CrossRef]
- Kim, D.; Esteki, D.J.; Hu, Y.C.; Kumar, P.R. A Lightweight Deterministic MAC Protocol Using Low Cross-Correlation Sequences. In Proceedings of the IEEE Global Telecommunications Conference (GLOBECOM), Houston, TX, USA, 5–9 December 2011; pp. 1–6. [Google Scholar] [CrossRef]
- Ponniah, J.; Theile, M.; Dantsker, O.; Caccamo, M. Autonomous Hierarchical Multi-Level Clustering for Multi-UAV Systems. In Proceedings of the AIAA Scitech Forum, Virtual, 11–15 January 2021. [Google Scholar] [CrossRef]
- Ramanathan, R.; Steenstrup, M. Hierarchically-organized, multihop mobile wireless networks for quality-of-service support. Mob. Networks Appl.
**1998**, 3, 101–119. [Google Scholar] [CrossRef] - Ozgur, A.; Leveque, O.; Tse, D.N. Hierarchical Cooperation Achieves Optimal Capacity Scaling in Ad Hoc Networks. IEEE Trans. Inf. Theory
**2007**, 53, 3549–3572. [Google Scholar] [CrossRef] - Ghaderi, J.; Xie, L.L.; Shen, X. Hierarchical Cooperation in Ad Hoc Networks: Optimal Clustering and Achievable Throughput. IEEE Trans. Inf. Theory
**2009**, 55, 3425–3436. [Google Scholar] [CrossRef] - Bourgault, F.; Durrant-Whyte, H.F. Communication in General Decentralized Filters and the Coordinated Search Strategy. In Proceedings of the International Conference on Information Fusion (FUSION), Stockholm, Sweden, 28 June–1 July 2004; pp. 723–770. [Google Scholar]
- Hollinger, G.A.; Yerramalli, S.; Singh, S.; Mitra, U.; Sukhatme, G.S. Distributed Data Fusion for Multirobot Search. IEEE Trans. Robot.
**2015**, 31, 55–66. [Google Scholar] [CrossRef] - Wu, F.; Zilberstein, S.; Chen, X. Multi-Agent Online Planning with Communication. In Proceedings of the 19th International Conference on Automated Planning and Scheduling (ICAPS), Thessaloniki, Greece, 19–23 September 2009. [Google Scholar]
- Stone, P.; Veloso, M. Task decomposition, dynamic role assignment, and low-bandwidth communication for real-time strategic teamwork. Artif. Intell.
**1999**, 110, 241–273. [Google Scholar] [CrossRef] - Valentini, G.; Hamann, H.; Dorigo, M. Global-to-Local Design for Self-Organized Task Allocation in Swarms. Intell. Comput.
**2022**, 2022, 9761694. [Google Scholar] [CrossRef] - Brambilla, M.; Ferrante, E.; Birattari, M.; Dorigo, M. Swarm robotics: A review from the swarm engineering perspective. Swarm Intell.
**2013**, 7, 1–41. [Google Scholar] [CrossRef] - Marchesini, E.; Amato, C. Safety-Informed Mutations for Evolutionary Deep Reinforcement Learning. In Proceedings of the GECCO ’22 Genetic and Evolutionary Computation Conference Companion, Boston, MA, USA, 9–13 July 2022; Association for Computing Machinery: New York, NY, USA, 2022; pp. 1966–1970. [Google Scholar] [CrossRef]
- Dorigo, M.; Theraulaz, G.; Trianni, V. Swarm Robotics: Past, Present, and Future [Point of View]. Proc. IEEE
**2021**, 109, 1152–1165. [Google Scholar] [CrossRef]

**Figure 1.**A multi-UAV tracking scenario. (

**a**) Burning trees and UAVs correspond to interest-points and agents respectively. (

**b**) The system must coordinate to visit as many interest-points as possible in the shortest possible time. (

**c**) The system state specifies the locations of agents and interest-points. The reward for a state is determined by the number of agents that coincide with interest-points. (

**d**) Interest-points evolve over time so the UAVs must track their distribution.

**Figure 2.**Fundamentals of reinforcement learning. (

**a**) The state-action-reward-state-action feedback loop. The state, reward, and action at time t is given by ${s}_{t}$, ${r}_{t}$, and ${a}_{t}$ respectively, where ${a}_{t}=\pi \left({s}_{t}\right)$ and $\pi (\xb7)$ is the control policy. The agent optimizes its policy by observing and reacting to the environment. (

**b**) The actor-critic training paradigm. The agent includes the actor and critic. The actor defines the control policy (the family of policies is parameterized by $\theta $). The critic evaluates the policy ($J(\theta )$ is the value of the policy ${\pi}_{\theta}(\xb7)$) based on feedback from the environment, and guides the actor to a better policy (via the gradient $\nabla \theta $).

**Figure 3.**Soft locally interactive structure in the global state space. (

**a**) The normalized distribution of agents and interest-points are approximately equal at the resolution of regions $\{1,2,3,4\},\left\{5\right\},\left\{6\right\},\left\{7\right\}$, so the optimal agent activity in regions $\{1,2,3,4\}$ is independent of regions 5, 6 and 7. (

**b**) Interest-points drift into region 6, destabilizing the equilibrium between the normalized distributions. Options or “macro-actions” move agents across regions to restore this equilibrium and “localize” the optimal activity.

**Figure 4.**A self-adjusting options-induced skyway network. (

**a**) The grid divides hierarchically into regions with the same number of interest-points at cells of the same level. (

**b**) Options-induced trajectories move UAVs to different regions of each cell. Higher throughput trajectories occur at higher levels and are represented by thicker lines. Base-stations are denoted in blue.

**Figure 5.**The consensus problem in cluster formation. (

**a**) Some agents that view themselves as members of blue clusters appear to other agents as heads of purple clusters. (

**b**) Agents attempting to form purple clusters become orphans (circled in orange).

**Figure 6.**Hierarchical multi-level clustering. (

**a**) 3-level hierarchical tree where blue = level 1, green = level 2, and orange = level 3. Every level 2 and level 3 cluster-head belongs to a level 1 cluster. (

**b**) Ideally, higher-level clusters are more stable and tolerate drift over longer trajectories.

**Figure 7.**Member transfer and cluster-head reassignment. (

**a**) A cluster-member drifts away from its cluster-head. (

**b**) The cluster elects a new cluster-head closer to the centroid. (

**c**) The new cluster-head initiates a transfer which allows the remote member to join the more suitable cluster.

**Figure 9.**The single-agent, single stationary interest-point scenario. (

**a**) The initial position of the agent is in cell (0,0). The interest-point is positioned in cell (2,2). (

**b**) The number of training steps per episode converges to the optimum of 4. (

**c**) The number of visits to each cell over the entire training process. Cells on the shortest path to the interest-point receive the most visits. (

**d**) The estimated maximum q-value of each cell. Cells closer to the interest-point have more value.

**Figure 10.**The single-agent, two stationary interest-points scenario. (

**a**) The initial position of the agent is in cell (0,0). The interest-points are positioned in cells (2,2) and (3,0) respectively. (

**b**) The number of training steps per episode does not appear to converge. (

**c**) The number of visits to each cell over the entire training process. Both interest-points receive many visits. (

**d**) The estimated maximum q-value of each cell. The max q-value of cell (3,0) is detrimentally small despite many visits and the interest-point location.

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Ponniah, J.; Dantsker, O.D. Strategies for Scaleable Communication and Coordination in Multi-Agent (UAV) Systems. *Aerospace* **2022**, *9*, 488.
https://doi.org/10.3390/aerospace9090488

**AMA Style**

Ponniah J, Dantsker OD. Strategies for Scaleable Communication and Coordination in Multi-Agent (UAV) Systems. *Aerospace*. 2022; 9(9):488.
https://doi.org/10.3390/aerospace9090488

**Chicago/Turabian Style**

Ponniah, Jonathan, and Or D. Dantsker. 2022. "Strategies for Scaleable Communication and Coordination in Multi-Agent (UAV) Systems" *Aerospace* 9, no. 9: 488.
https://doi.org/10.3390/aerospace9090488