DETEAMSK: A Model-Based Reinforcement Learning Approach to Intelligent Top-Level Planning and Decisions for Multi-Drone Ad Hoc Teamwork by Decoupling the Identification of Teammate and Task
Abstract
1. Introduction
- (i)
- Observation of the environment: This perspective examines the performance of ad hoc teams under complex environmental conditions, encompassing both fully observable and partially observable scenarios [9,14]. In fully observable scenarios, agents have complete knowledge of the environment and reward structure. Conversely, in partially observable scenarios, agents must predict these factors based on limited information [18,19,20,21].
- (ii)
- External situation changes: This perspective examines how ad hoc teams adapt to various external challenges to enhance coordination stability and robustness. In this context, the team is considered a unit interacting with its environment, encompassing scenarios such as potential attackers [22], coordination with teammates [23], dynamic environments [24], and few-shot teamwork [25].
- (iii)
- Internal coordination types: This perspective examines the self-coordination styles of ad hoc teams, categorized into closed-type and open-type [24,26]. In closed-type teams, members remain constant and aim to observe and learn the known behaviors of their teammates to predict future actions and enhance coordination [12,19,27,28]. In open-type teams, the team composition varies over time, requiring more complex analysis for coordination [29,30,31,32]. This perspective primarily studies how diverse ad hoc teams achieve efficient coordination.
2. Related Works
3. Background
4. The DETEAMSK Algorithm Method
Algorithm 1: DETEAMSK pseudo-code description |
Basic Algorithm Models: Teammate Identification Model , Environment Identification Model E, Action Planning and Learning Mechanism 1. First runtime: t = 0 2. Set teammate module 3. Set the regarding belief of the current team to build new 4. Repeat 1–3 above 5. Observe the current state 6. 7. 8. Select and execute and continue to observe 9. Update to 10. Update to 11. Update to 12. 13. Run until is to the end |
4.1. Teammate Behavior Identification
4.2. Task Environment Identification
4.3. Action Planning and Learning
4.3.1. Multi-Agent Collaborative Dynamic Programming and Learning Framework
4.3.2. Multi-Agent Collaborative Dynamic Programming MATM-UCT Search
Algorithm 2: MATM-UCT pseudo-code description |
1. Function: generates actions with dynamic programming 2. For at current timestep t, repeat: 3. Predict the joint action of teammates 4. Predict the environment condition 5. Compute and Normalize the reward 6. Select, expand, rollout, backup 7. Exploring and exploitation Policy () 8. Return |
4.3.3. Reinforcement Learning Exploration and Exploitation
5. Experiments
5.1. Settings
5.2. Methodology
5.2.1. Baseline Algorithm Methods
5.2.2. Evaluation Procedures
5.2.3. Experiment Items
5.3. Results
5.3.1. Stand Ad Hoc Task Evaluation
5.3.2. Method Mechanism Optimization Evaluation
5.3.3. Performance Convergence Evaluation
6. Discussions
- (i)
- Environment scalability. Whether in cooperative or competitive tasks, DETEAMSK demonstrates significant effects in simple domain tasks, such as the CBD domain, by yielding more stable learning curves, higher accumulated average rewards, and lower variances. However, its performance in more complex domains is less favorable and could benefit from further optimization. In the SAD and MFD task domains, DETEAMSK’s learning processes exhibit greater fluctuations and lower evaluated rewards with increased variances. This may be attributed to the challenges of learning in high-dimensional, dynamic environments. Further research is warranted to explore the impact of environmental factors.
- (ii)
- Interaction complexity. The ad hoc agent exhibits an enhanced and more consistent performance when collaborating with teammates of the same category during both the learning and interaction processes. While the agent generally performs effectively after engaging with a simple-class team, some instability may still be observed. To further evaluate the generalizability of DETEAMSK, it is crucial to increase the diversity of interactions.
- (iii)
- Internal method architecture. Several components within the architecture of DETEAMSK can be optimized to enhance its overall performance, including the optimizer, the action exploration and exploitation policy, the hyperparameters, etc. Our experiments suggest that Rspromp is effective for DETEAMSK across various domain tasks when used as the optimizer in feedforward neural networks. For simpler environments, SGD is appropriate, while AdamW is suitable for small to medium-sized task domains. Investigating more advanced methods is crucial to further improve DETEAMSK.
7. Conclusions and Future Work
Supplementary Materials
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
Appendix A
Appendix A.1. Hyperparameters and Architectures Used in Experiments
Task Domain | Total Iterations | Maximum Rollout Depth | UCB Exploration | Discount Factor |
---|---|---|---|---|
CBD | 150 | 15 | 0.95 | |
SAD | 150 | 15 | 0.95 | |
MFD | 150 | 15 | 0.95 | |
PAE | 300 | 5 | 0.95 |
Model | Optimizer | Learning Rate | Hidden Layers | Hidden Units per Layer | Activation Function | Exploration and Exploitation Policy |
---|---|---|---|---|---|---|
Transition | Adam/Sgd /Rmsprop /Adagrad /Adamw | 0.001 | 2 | 512 | ReLu | Greedy Policy |
Rewards | Adam/Sgd /Rmsprop /Adagrad /Adamw | 0.01 | 1 | 64 | ReLu | Greedy Policy |
Teammate Identification | Adam/Sgd /Rmsprop /Adagrad /Adamw | 0.001 | 2 | 48 | ReLu | Greedy Policy |
Model | Optimizer | Learning Rate | Hidden Layers | Hidden Units per Layer | Activation Function | Exploration and Exploitation Policy |
---|---|---|---|---|---|---|
Transition | Adam | 0.001 | 2 | 256 | ReLu | Linear Annealing Policy |
Rewards | Adam | 0.001 | 2 | 256 | ReLu | Linear Annealing Policy |
Teammate Identification | Adam | 0.001 | 2 | 48 | ReLu | Linear Annealing Policy |
Task Domain | Adam Learning Rate | Hidden Layers | Hidden Units per Layer | Activation Function | Initial Exploration Rate | Final Exploration Rate | Initial Random Steps | Final Exploration Step | Target Update Frequency (Steps) |
---|---|---|---|---|---|---|---|---|---|
CBD | 0.001 | 2 | 512 | ReLu | 1.00 | 0.05 | 15000 | 250000 | 4 |
SAD | 0.001 | 2 | 512 | ReLu | 1.00 | 0.05 | 15000 | 250000 | 4 |
MFD | 0.001 | 2 | 512 | ReLu | 1.00 | 0.05 | 15000 | 250000 | 4 |
PAE | 0.001 | 2 | 64 | ReLu | 0.50 | 0.05 | 0 | 5000 | 1 |
Appendix A.2. The Average Accumulated Reward Values of the Ad Hoc Teamwork Interaction of DETEAMSK and Other Baseline Algorithms
Baseline | Eval on | Eval on | Eval on | Eval on |
---|---|---|---|---|
Original Teammate | 98.000(±0.000) | 98.000(±0.000) | 97.750(±1.750) | ∖ |
Radom Policy | 83.000(±39.000) | 83.000(±39.000) | 83.000(±39.000) | 83.000(±39.000) |
UCT -Model | 97.750(±0.750) | 98.000(±0.000) | 97.750(±1.750) | 97.875(±0.875) |
After Learning | Eval on | Eval on | Eval on | Eval on |
DETEAMSK-Adam | 97.875(±0.875) | 98.000(±0.000) | 97.875(±0.875) | 97.750(±1.750) |
DETEAMSK-Sgd | 96.750(±8.750) | 97.625(±1.625) | 97.125(±2.125) | 97.500(±0.500) |
DETEAMSK-Rmsprop | 98.000(±0.000) | 97.875(±0.875) | 97.750(±1.750) | 98.000(±0.000) |
DETEAMSK-Adagrad | 84.875(±85.875) | 96.375(±5.375) | 72.375(±73.375) | 71.875(±72.875) |
DETEAMSK-Adamw | 98.000(±0.000) | 97.875(±0.875) | 97.750(±1.750) | 98.000(±0.000) |
PLASTIC-Policy | 98.000(±0.000) | 98.000(±0.000) | 98.000(±0.000) | 98.000(±0.000) |
DE Learns Team | 97.750(±0.750) | 97.750(±0.750) | 97.625(±0.625) | 97. 750(±0.750) |
DE Learns Environment | 98.000(±0.000) | 97.625(±1.625) | 97.750(±1.750) | 98.000(±0.000) |
PLASTIC-Model | 97.875(±0.875) | 98.000(±0.000) | 97.750(±0.750) | 97.750(±1.750) |
After Learning | Eval on | Eval on | Eval on | Eval on |
DETEAMSK-Adam | 98.000(±0.000) | 97.875(±0.875) | 98.000(±0.000) | 98.000(±0.000) |
DETEAMSK-Sgd | 97.750(±0.750) | 97.875(±0.875) | 97.625(±1.625) | 97.875(±0.875) |
DETEAMSK-Rmsprop | 97.750(±0.750) | 97.625(±0.625) | 97.625(±1.625) | 97.875(±0.875) |
DETEAMSK-Adagrad | 96.875(±1.875) | 71.250(±72.250) | 96.875(±1.125) | 72.375(±73.375) |
DETEAMSK-Adamw | 97.750(±0.750) | 98.000(±0.000) | 97.750(±1.750) | 97.625(±1.625) |
PLASTIC-Policy | 98.000(±0.000) | 98.000(±0.000) | 97.750(±1.750) | 98.000(±0.000) |
DE Learns Team | 98.000(±0.000) | 97.750(±0.750) | 97.625(±1.625) | 97.625(±1.625) |
DE Learns Environment | 97.750(±0.750) | 97.875(±0.875) | 97.750(±0.750) | 98.000(±0.000) |
PLASTIC-Model | 98.000(±0.000) | 97.875(±0.875) | 98.000(±0.000) | 97.625(±1.625) |
After Learning | Eval on | Eval on | Eval on | Eval on |
DETEAMSK-Adam | 97.875(±0.875) | 97.375(±2.375) | 97.750(±0.750) | 98.000(±0.000) |
DETEAMSK-Sgd | 97.125(±1.125) | 97.375(±1.375) | 97.375(±1.375) | 97.500(±1.500) |
DETEAMSK-Rmsprop | 97.875(±0.875) | 97.875(±0.875) | 97.750(±1.750) | 98.000(±0.000) |
DETEAMSK-Adagrad | 96.500(±1.500) | 85.000(±86.000) | 96.875(±1.125) | 96.625(±1.375) |
DETEAMSK-Adamw | 97.625(±0.625) | 97.875(±0.875) | 97.750(±1.750) | 97.875(±0.875) |
PLASTIC-Policy | 98.000(±0.000) | 98.000(±0.000) | 98.000(±0.000) | 98.000(±0.000) |
DE Learns Team | 97.875(±0.875) | 97.750(±0.750) | 97.750(±1.750) | 98.000(±0.000) |
DE Learns Environment | 97.875(±0.875) | 97.750(±0.750) | 97.500(±1.500) | 97.750(±1.750) |
PLASTIC-Model | 98.000(±0.000) | 97.875(±0.875) | 97.875(±0.875) | 98.000(±0.000) |
After Learning | Eval on | Eval on | Eval on | Eval on |
DETEAMSK-Adam | 98.000(±0.000) | 97.875(±0.000) | 97.875(±0.875) | 97.625(±1.625) |
DETEAMSK-Sgd | 97.000(±5.000) | 97.875(±5.000) | 97.750(±0.750) | 97.625(±0.625) |
DETEAMSK-Rmsprop | 98.000(±0.000) | 97.750(±0.000) | 97.750(±1.750) | 97.625(±0.625) |
DETEAMSK-Adagrad | 72.625(±73.625) | 80.875(±73.625) | 84.375(±85.375) | 83.500(±84.500) |
DETEAMSK-Adamw | 97.750(±1.750) | 97.875(±1.750) | 98.000(±0.000) | 97.875(±0.875) |
PLASTIC-Policy | 98.000(±0.000) | 98.000(±0.000) | 98.000(±0.000) | 98.000(±0.000) |
DE Learns Team | 98.000(±0.00) | 97.875(±0.000) | 97.375(±2.375) | 97.625(±0.625) |
DE Learns Environment | 97.875(±0.875) | 97.750(±0.875) | 98.000(±0.000) | 97.625(±1.625) |
PLASTIC-Model | 98.000(±0.000) | 97.875(±0.000) | 98.000(±0.000) | 98.000(±0.000) |
Baseline | Eval on | Eval on | Eval on | Eval on |
---|---|---|---|---|
Original Teammate | 90.000(±2.000) | 83.000(±0.000) | 89.875(±1.875) | ∖ |
Radom Policy | 78.250(±39.250) | 54.875(±49.875) | 67.250(±48.250) | 53.750(±51.750) |
UCT -Model | 89.125(±1.875) | 84.500(±5.500) | 89.375(±2.375) | 88.750(±2.250) |
After Learning | Eval on | Eval on | Eval on | Eval on |
DETEAMSK-Adam | 78.750(±79.750) | 74.375(±75.375) | 91.625(±3.625) | 75.500(±76.500) |
DETEAMSK-Sgd | 22.750(±71.250) | 32.750(±56.250) | 33.875(±58.125) | 22.000(±70.000) |
DETEAMSK-Rmsprop | 92.000(±7.000) | 76.250(±77.250) | 75.000(±76.000) | 83.625(±27.625) |
DETEAMSK-Adagrad | 45.875(±46.875) | −1.000(±0.000) | 29.750(±63.250) | 42.000(±44.000) |
DETEAMSK-Adamw | 88.250(±17.250) | 52.125(±53.125) | 85.125(±26.125) | 33.250(±58.750) |
PLASTIC-Policy | 81.000(±82.000) | 66.000(±67.000) | 92.250(±3.250) | 56.000(±57.000) |
DE Learns Team | 92.375(±6.375) | 74.375(±75.375) | 91.625(±3.625) | 75.500(±76.500) |
DE Learns Environment | 79.000(±80.000) | 73.125(±74.125) | 88.750(±8.750) | 80.875(±81.875) |
PLASTIC-Model | 91.750(±3.750) | 59.125(±60.125) | 91.125(±5.125) | 78.750(±79.750) |
After Learning | Eval on | Eval on | Eval on | Eval on |
DETEAMSK-Adam | 78.750(±79.750) | 74.375(±75.375) | 91.625(±3.625) | 75.500(±76.500) |
DETEAMSK-Sgd | 22.750(±71.250) | 32.750(±56.250) | 33.875(±58.125) | 22.000(±70.000) |
DETEAMSK-Rmsprop | 92.000(±7.000) | 76.250(±77.250) | 75.000(±76.000) | 83.625(±27.625) |
DETEAMSK-Adagrad | 45.875(±46.875) | −1.000(±0.000) | 29.750(±63.250) | 42.000(±44.000) |
DETEAMSK-Adamw | 88.250(±17.250) | 52.125(±53.125) | 85.125(±26.125) | 33.250(±58.750) |
PLASTIC-Policy | 81.000(±82.000) | 66.000(±67.000) | 92.250(±3.250) | 56.000(±57.000) |
DE Learns Team | 92.375(±6.375) | 74.375(±75.375) | 91.625(±3.625) | 75.500(±76.500) |
DE Learns Environment | 79.000(±80.000) | 73.125(±74.125) | 88.750(±8.750) | 80.875(±81.875) |
PLASTIC-Model | 91.750(±3.750) | 59.125(±60.125) | 91.125(±5.125) | 78.750(±79.750) |
After Learning | Eval on | Eval on | Eval on | Eval on |
DETEAMSK-Adam | 64.625(±65.625) | 62.125(±63.125) | 88.750(±19.750) | 72.375(±73.375) |
DETEAMSK-Sgd | 33.875(±60.125) | 29.750(±59.250) | 29.750(±59.250) | 61.250(±62.250) |
DETEAMSK-Rmsprop | 81.250(±76.250) | 75.250(±76.250) | 75.250(±76.250) | 66.000(±67.000) |
DETEAMSK-Adagrad | 26.125(±63.875) | −1.000(±0.000) | −1.000(±0.000) | 28.875(±63.125) |
DETEAMSK-Adamw | 80.735(±81.735) | 63.500(±64.500) | 65.125(±66.125) | 33.250(±58.750) |
PLASTIC-Policy | 63.000(±64.000) | 65.000(±66.000) | 86.750(±8.750) | 79.125(±80.125) |
DE Learns Team | 92.750(±2.750) | 62.125(±63.125) | 88.750(±19.750) | 72.375(±73.375) |
DE Learns Environment | 90.500(±4.500) | 76.750(±77.750) | 92.250(±3.250) | 77.750(±78.750) |
PLASTIC-Model | 91.375(±4.375) | 75.250(±76.250) | 91.750(±8.750) | 78.875(±79.875) |
After Learning | Eval on | Eval on | Eval on | Eval on |
DETEAMSK-Adam | 58.625(±59.625) | 76.250(±77.250) | 81.250(±52.250) | 67.750(±68.750) |
DETEAMSK-Sgd | 10.875(±83.125) | 43.250(±83.125) | 32.625(±59.375) | 44.500(±47.500) |
DETEAMSK-Rmsprop | 80.375(±81.375) | 75.250(±8.375) | 88.500(±12.500) | 77.125(±78.125) |
DETEAMSK-Adagrad | 5.500(±45.500) | 21.500(±45.500) | 29.875(±49.125) | 15.500(±76.500) |
DETEAMSK-Adamw | 88.000(±9.000) | 42.875(±9.000) | 68.250(±69.250) | 56.250(±57.250) |
PlASTIC-Policy | 88.500(±8.500) | 87.750(±8.750) | 78.875(±79.875) | 78.375(±79.375) |
DE Learns Team | 87.750(±4.750) | 76.250(±77.250) | 81.250(±52.250) | 67.750(±68.750) |
DE Learns Environment | 87.625(±17.625) | 87.750(±5.750) | 89.375(±4.625) | 78.250(±79.250) |
PLASTIC-Model | 91.750(±4.750) | 74.500(±75.500) | 89.000(±26.000) | 77.250(±78.250) |
Baseline | Eval on | Eval on | Eval on | Eval on |
---|---|---|---|---|
Original Teammate | 90.000(±2.000) | 83.000(±0.000) | 89.875(±1.875) | ∖ |
Radom Policy | 78.250(±39.250) | 54.875(±49.875) | 67.250(±48.250) | 53.750(±51.750) |
UCT -Model | 89.125(±1.875) | 84.500(±5.500) | 89.375(±2.375) | 88.750(±2.250) |
After Learning | Eval on | Eval on | Eval on | Eval on |
DETEAMSK-Adam | 90.375(±2.375) | 87.125(±4.125) | 90.125(±2.125) | 88.500(±3.500) |
DETEAMSK-Sgd | 83.375(±29.375) | 72.125(±73.125) | 71.125(±72.125) | 64.250(±41.250) |
DETEAMSK-Rmsprop | 89.625(±1.625) | 69.875(±66.875) | 90.625(±0.625) | 86.250(±18.250) |
DETEAMSK-Adagrad | 48.250(±49.250) | 52.750(±53.750) | 72.750(±73.750) | 21.250(±66.750) |
DETEAMSK-Adamw | 88.250(±1.750) | 68.250(±69.250) | 89.750(±1.750) | 88.625(±2.625) |
PLASTIC-Policy | 44.375(±46.625) | 31.250(±57.750) | −1.000(±0.000) | 55.000(±56.000) |
DE Learns Team | 89.875(±1.875) | 78.875(±14.875) | 89.250(±1.750) | 88.250(±2.750) |
DE Learns Environment | 89.625(±1.625) | 84.750(±5.250) | 89.375(±2.375) | 89.000(±2.000) |
PLASTIC-Model | 90.375(±2.375) | 72.75(±73.750) | 89.000(±2.000) | 88.125(±2.875) |
After Learning | Eval on | Eval on | Eval on | Eval on |
DETEAMSK-Adam | 88.125(±7.125) | 84.750(±5.250) | 79.000(±80.000) | 86.750(±3.250) |
DETEAMSK-Sgd | 37.375(±51.625) | 74.875(±52.875) | 53.625(±54.625) | 49.875(±50.875) |
DETEAMSK-Rmsprop | 88.500(±14.500) | 71.625(±72.625) | 88.000(±5.000) | 87.250(±4.250) |
DETEAMSK-Adagrad | 48.875(±49.875) | 58.250(±59.250) | 53.625(±54.625) | 46.250(±47.250) |
DETEAMSK-Adamw | 86.625(±21.625) | 76.625(±30.625) | 89.125(±3.125) | 80.125(±40.250) |
PLASTIC-Policy | 78.250(±79.250) | 32.000(±57.000) | 44.875(±46.125) | 67.125(±68.125) |
DE Learns Team | 88.875(±2.125) | 83.750(±15.750) | 89.125(±9.125) | 88.750(±2.250) |
DE Learns Environment | 89.000(±2.000) | 85.625(±4.375) | 90.125(±2.125) | 87.875(±4.875) |
PlASTIC-Model | 90.125(±2.125) | 81.875(±21.875) | 89.125(±1.875) | 87.875(±6.875) |
After Learning | Eval on | Eval on | Eval on | Eval on |
DETEAMSK-Adam | 90.000(±2.000) | 83.625(±6.625) | 89.750(±1.750) | 88.750(±5.750) |
DETEAMSK-Sgd | 66.750(±67.750) | 60.125(±61.125) | 60.000(±61.000) | 68.500(±69.500) |
DETEAMSK-Rmsprop | 89.250(±1.750) | 79.125(±24.125) | 88.375(±2.375) | 89.000(±3.000) |
DETEAMSK-Adagrad | 74.875(±75.875) | 40.375(±42.625) | 42.000(±48.000) | 56.000(±57.000) |
DETEAMSK-Adamw | 89.250(±1.250) | 82.500(±8.500) | 90.125(±2.125) | 84.625(±31.625) |
PLASTIC-Policy | 89.625(±2.625) | 53.375(±54.375) | 55.875(±56.875) | 76.625(±77.625) |
DE Learns Team | 89.125(±1.875) | 81.000(±12.000) | 89.750(±1.750) | 89.125(±1.875) |
DE Learns Environment | 89.500(±1.500) | 83.875(±6.125) | 88.875(±9.875) | 90.000(±2.000) |
PLASTIC-Model | 88.875(±2.125) | 84.750(±5.250) | 89.500(±1.500) | 87.375(±6.375) |
After Learning | Eval on | Eval on | Eval on | Eval on |
DETEAMSK-Adam | 88.375(±5.375) | 83.500(±6.500) | 88.250(±12.250) | 89.125(±3.125) |
DETEAMSK-Sgd | 54.750(±55.750) | 58.750(±55.750) | 26.125(±64.875) | 78.625(±23.625) |
DETEAMSK-Rmsprop | 88.375(±2.625) | 67.500(±2.625) | 88.125(±7.125) | 89.000(±5.000) |
DETEAMSK-Adagrad | 45.500(±46.500) | 44.125(±46.500) | 55.750(±56.750) | 63.000(±64.000) |
DETEAMSK-Adamw | 89.625(±1.625) | 83.125(±1.625) | 89.250(±1.750) | 88.750(±2.250) |
PLASTIC-Policy | 78.750(±79.750) | 74.500(±75.500) | 78.000(±79.000) | 76.750(±77.750) |
DE Learns Team | 88.625(±2.375) | 82.500(±10.500) | 89.375(±1.625) | 89.500(±1.500) |
DE Learns Environment | 89.000(±2.000) | 84.750(±5.250) | 89.250(±2.250) | 88.625(±2.375) |
PLASTIC-Model | 90.000(±2.000) | 82.125(±7.875) | 89.000(±4.000) | 83.625(±36.625) |
Baseline | Eval on | Eval on | Eval on | Eval on |
---|---|---|---|---|
Original Teammate | 91.125(±6.125) | 92.000(±6.000) | 95.500(±3.500) | ∖ |
Radom Policy | 76.750(±23.250) | 72.000(±34.000) | 20.875(±170.875) | 31.000(±181.000) |
UCT-Model | 91.250(±14.250) | 94.375(±5.375) | 92.625(±6.625) | ∖ |
After Learning | Eval on | Eval on | Eval on | Eval on |
DETEAMSK | 90.750(±7.750) | 91.125(±8.875) | 80.500(±23.500) | 88.875(±24.875) |
PLASTIC-Policy | 91.375(±7.625) | 76.375(±48.375) | 84.875(±19.875) | 82.875(±23.875) |
DE Learns Team | 91.250(±17.250) | 90.625(±15.625) | 93.250(±9.250) | 93.500(±9.500) |
DE Learns Environment | 88.375(±17.375) | 85.875(±24.875) | 83.875(±17.875) | 78.250(±40.250) |
PLASTIC-Model | 96.125(±2.125) | 87.750(±21.750) | 81.875(±23.875) | 87.000(±13.000) |
After Learning | Eval on | Eval on | Eval on | Eval on |
DETEAMSK | 91.625(±8.625) | 91.750(±91.750) | 89.250(±30.250) | 78.875(±26.875) |
PLASTIC-Policy | 90.125(±17.125) | 90.125(±18.125) | 82.125(±13.125) | 82.000(±17.000) |
DE Learns Team | 91.000(±7.000) | 93.500(±7.500) | 94.125(±9.125) | 93.125(±13.125) |
DE Learns Environment | 85.875(±18.875) | 90.250(±9.250) | 92.125(±18.125) | 85.625(±19.625) |
PLASTIC-Model | 91.000(±2.125) | 96.000(±9.000) | 96.000(±5.000) | 90.000(±25.000) |
After Learning | Eval on | Eval on | Eval on | Eval on |
DETEAMSK | 90.875(±9.875) | 89.750(±13.750) | 93.000(±9.000) | 86.375(±29.375) |
PLASTIC-Policy | 91.875(±7.875) | 88.000(±16.000) | 87.875(±34.875) | 71.875(±58.875) |
DE Learns Team | 93.750(±6.750) | 94.000(±5.000) | 95.250(±3.750) | 90.375(±21.375) |
DE Learns Environment | 91.125(±17.125) | 86.625(±21.625) | 89.375(±8.625) | 73.125(±25.125) |
PLASTIC-Model | 94.250(±8.250) | 96.375(±3.375) | 94.750(±6.750) | 90.500(±10.500) |
After Learning | Eval on | Eval on | Eval on | Eval on |
DETEAMSK | 90.250(±11.250) | 91.375(±22.375) | 89.125(±27.125) | 87.125(±21.125) |
PLASTIC-Policy | 72.125(±24.875) | 89.000(±12.000) | 70.875(±70.875) | 85.750(±12.750) |
DE Learns Team | 89.625(±1.625) | 69.875(±66.875) | 90.625(±0.625) | 86.250(±18.250) |
DE Learns Environment | ∖ | ∖ | ∖ | ∖ |
PLASTIC-Model | ∖ | ∖ | ∖ | ∖ |
Baseline | Eval on | Eval on | Eval on | Eval on |
---|---|---|---|---|
Original Teammate | 82.750(±17.750) | 73.375(±53.375) | 14.125(±164.125) | ∖ |
Radom Policy | −86.000(±164.000) | −129.375(±144.375) | −94.875(±186.875) | −121.875(±196.875) |
UCT-Model | 88.125(±8.125) | 66.750(±27.250) | 73.875(±22.125) | ∖ |
After Learning | Eval on | Eval on | Eval on | Eval on |
DETEAMSK | −30.250(±119.750) | 4.125(±154.125) | −36.500(±119.500) | −94.625(±128.625) |
PLASTIC-Policy | 20.000(±170.0) | −54.750(±132.75) | −92.625(±180.625) | 7.625(±157.625) |
DE Learns Team | 76.750(±15.750) | 53.125(±203.125) | 40.500(±49.500) | 22.000(±172.000) |
DE Learns Environment | 11.625(161.625) | −109.375(±162.375) | −58.250(±135.250) | −96.250(±172.250) |
PLASTIC-Model | 91.125(±5.875) | 27.500(±177.500) | 27.000(±177.000) | −96.250(±172.250) |
After Learning | Eval on | Eval on | Eval on | Eval on |
DETEAMSK | 34.375(±61.375) | −13.500(±136.500) | 4.375(±154.375) | −30.000(±120.000) |
PLASTIC-Policy | 10.625(±160.625) | −98.625(±187.625) | −136.625(±93.625) | −67.625(±117.625) |
DE Learns Team | 73.000(±30.000) | 43.750(±193.75) | 72.000(±18.000) | 64.375(±46.375) |
DE Learns Environment | −60.750(±148.75) | −33.000(±117.000) | 16.000(±166.000) | −134.875(±105.875) |
PLASTIC-Model | 86.750(±12.750) | 68.750(±49.750) | 64.500(±36.500) | −134.875(±105.875) |
After Learning | Eval on | Eval on | Eval on | Eval on |
DETEAMSK | −20.750(±129.250) | −35.125(±125.125) | 4.500(±154.500) | −105.875(±160.875) |
PLASTIC-Policy | −9.375(±140.625) | −92.375(±139.375) | −9.000(±141.000) | −106.625(±186.625) |
DE Learns Team | 85.125(±8.875) | 72.375(±21.625) | 75.625(±45.625) | 71.875(±71.875) |
DE Learns Environment | −69.375(±131.375) | −23.375(±126.625) | −43.750(±139.750) | −33.500(±121.500) |
PLASTIC-Model | 88.250(±21.250) | 68.125(±72.125) | 74.750(±62.750) | −33.500(±121.500) |
After Learning | Eval on | Eval on | Eval on | Eval on |
DETEAMSK | 2.500(±152.500) | −46.375(±140.375) | −12.000(±138.000) | −34.875(±115.125) |
PLASTIC-Policy | −56.125(±131.125) | −32.250(±129.250) | −47.750(±141.750) | −88.125(±142.125) |
DE Learns Team | 80.500(±19.500) | 50.000(±200.000) | 35.750(±66.750) | 71.125(±94.125) |
DE Learns Environment | ∖ | ∖ | ∖ | ∖ |
PLASTIC-Model | ∖ | ∖ | ∖ | ∖ |
Baseline | Eval on | Eval on | Eval on | Eval on |
---|---|---|---|---|
Original Teammate | 98.000(±0.000) | 98.000(±0.000) | 97.750(±1.750) | ∖ |
Radom Policy | 83.000(±39.000) | 83.000(±39.000) | 83.000(±39.000) | 83.000(±39.000) |
UCT-Model | 97.750(±0.750) | 98.000(±0.000) | 97.750(±1.750) | 97.875(±0.875) |
After Learning | Eval on | Eval on | Eval on | Eval on |
DETEAMSK-Greedy policy | 97.875(±0.875) | 98.000(±0.000) | 97.875(±0.875) | 97.750(±1.750) |
DETEAMSK-Boltzman policy | 97.875(±0.875) | 98.000(±0.000) | 97.625(±1.625) | 98.000(±0.000) |
DETEAMSK-Epsilon greedy policy | 96.875(±1.125) | 95.250(±5.250) | 96.875(±1.875) | 95.625(±3.625) |
PLASTIC-Policy | 98.000(±0.000) | 98.000(±0.000) | 98.000(±0.000) | 98.000(±0.000) |
DE Learns Team | 97.750(±0.750) | 97.750(±0.750) | 97.625(±0.625) | 97.750(±0.750) |
DE Learns Environment | 98.000(±0.000) | 97.625(±1.625) | 97.750(±1.750) | 98.000(±0.000) |
PLASTIC-Model | 97.875(±0.875) | 98.000(±0.000) | 97.750(±0.750) | 97.750(±1.750) |
After Learning | Eval on | Eval on | Eval on | Eval on |
DETEAMSK-Greedy policy | 98.000(±0.000) | 97.875(±0.875) | 98.000(±0.000) | 98.000(±0.000) |
DETEAMSK-Boltzman policy | 97.875(±0.875) | 98.000(±0.000) | 98.000(±0.000) | 97.625(±1.625) |
DETEAMSK-Epsilon greedy policy | 95.750(±2.750) | 97.125(±2.125) | 95.000(±13.000) | 84.000(±85.000) |
PLASTIC-Policy | 98.000(±0.000) | 98.000(±0.000) | 97.750(±1.750) | 98.000(±0.000) |
DE Learns Team | 98.000(±0.000) | 97.750(±0.750) | 97.625(±1.625) | 97.625(±1.625) |
DE Learns Environment | 97.750(±0.750) | 97.875(±0.875) | 97.750(±0.750) | 98.000(±0.000) |
PLASTIC-Model | 98.000(±0.000) | 97.875(±0.875) | 98.000(±0.000) | 97.625(±1.625) |
After Learning | Eval on | Eval on | Eval on | Eval on |
DETEAMSK-Greedy policy | 97.875(±0.875) | 97.375(±2.375) | 97.750(±0.750) | 98.000(±0.000) |
DETEAMSK-Boltzman policy | 98.000(±0.000) | 97.750(±0.750) | 97.625(±1.625) | 98.000(±0.000) |
DETEAMSK-Epsilon greedy policy | 96.125(±4.125) | 95.875(±2.875) | 95.500(±3.500) | 96.000(±4.000) |
PLASTIC-Policy | 98.000(±0.000) | 98.000(±0.000) | 98.000(±0.000) | 98.000(±0.000) |
DE Learns Team | 97.875(±0.875) | 97.750(±0.750) | 97.750(±1.750) | 98.000(±0.000) |
DE Learns Environment | 97.875(±0.875) | 97.750(±0.750) | 97.500(±1.500) | 97.750(±1.750) |
PLASTIC-Model | 98.000(±0.000) | 97.875(±0.875) | 97.875(±0.875) | 98.000(±0.000) |
After Learning | Eval on | Eval on | Eval on | Eval on |
DETEAMSK-Greedy policy | 98.000(±0.000) | 97.875(±0.000) | 97.875(±0.875) | 97.625(±1.625) |
DETEAMSK-Boltzman policy | 98.000(±0.000) | 97.875(±0.000) | 97.750(±1.750) | 97.750(±1.750) |
DETEAMSK-Epsilon greedy policy | 95.875(±3.875) | 96.375(±3.875) | 97.700(±3.000) | 95.875(±9.875) |
PLASTIC-Policy | 98.000(±0.000) | 98.000(±0.000) | 98.000(±0.000) | 98.000(±0.000) |
DE Learns Team | 98.000(±0.000) | 97.875(±0.000) | 97.375(±2.375) | 97.625(±0.625) |
DE Learns Environment | 97.875(±0.875) | 97.750(±0.875) | 98.000(±0.000) | 97.625(±1.625) |
PLASTIC-Model | 98.000(±0.000) | 97.875(±0.000) | 98.000(±0.000) | 98.000(±0.000) |
Baseline | Eval on | Eval on | Eval on | Eval on |
---|---|---|---|---|
Original Teammate | 90.000(±2.000) | 83.000(±0.000) | 89.875(±1.875) | ∖ |
Radom Policy | 78.250(±39.250) | 54.875(±49.875) | 67.250(±48.250) | 53.750(±51.750) |
UCT-Model | 89.125(±1.875) | 84.500(±5.500) | 89.375(±2.375) | 88.750(±2.250) |
After Learning | Eval on | Eval on | Eval on | Eval on |
DETEAMSK-Greedy policy | 78.750(±79.750) | 74.375(±75.375) | 91.625(±3.625) | 75.500(±76.500) |
DETEAMSK-Boltzman policy | 90.125(±21.125) | 53.500(±54.500) | 90.750(±4.750) | 64.625(±65.625) |
DETEAMSK-Epsilon greedy policy | 91.250(±3.250) | 27.875(±58.125) | 71.000(±72.000) | 53.375(±54.375) |
PLASTIC-Policy | 81.000(±82.000) | 66.000(±67.000) | 92.250(±3.250) | 56.000(±57.000) |
DE Learns Team | 92.375(±6.375) | 74.375(±75.375) | 91.625(±3.625) | 75.500(±76.500) |
DE Learns Environment | 79.000(±80.000) | 73.125(±74.125) | 88.750(±8.750) | 80.875(±81.875) |
PLASTIC-Model | 91.750(±3.750) | 59.125(±60.125) | 91.125(±5.125) | 78.750(±79.750) |
After Learning | Eval on | Eval on | Eval on | Eval on |
DETEAMSK-Greedy policy | 90.125(±1.875) | 86.625(±6.625) | 76.875(±77.875) | 77.500(±78.500) |
DETEAMSK-Boltzman policy | 90.125(±2.125) | 73.625(±57.625) | 87.875(±8.875) | 90.250(±7.250) |
DETEAMSK-Epsilon greedy policy | 73.625(±74.625) | 62.875(±63.875) | 71.875(±72.875) | 68.750(±57.750) |
PLASTIC-Policy | 87.000(±11.000) | 65.500(±66.500) | 47.500(±48.500) | 71.750(±72.750) |
DE Learns Team | 90.250(±4.250) | 86.625(±6.625) | 76.875(±77.875) | 77.500(±78.500) |
DE Learns Environment | 90.500(±3.500) | 86.875(±10.875) | 87.375(±17.375) | 85.750(±17.750) |
PLASTIC-Model | 91.875(±5.875) | 87.375(±7.375) | 91.125(±6.125) | 66.250(±67.250) |
After Learning | Eval on | Eval on | Eval on | Eval on |
DETEAMSK-Greedy policy | 64.625(±65.625) | 62.125(±63.125) | 88.750(±19.750) | 72.375(±73.375) |
DETEAMSK-Boltzman policy | 94.000(±2.250) | 62.875(±63.875) | 92.250(±2.250) | 73.875(±74.875) |
DETEAMSK-Epsilon greedy policy | 70.375(±56.375) | 53.375(±54.375) | 72.250(±73.250) | 47.000(±48.000) |
PLASTIC-Policy | 63.000(±64.000) | 65.000(±66.000) | 86.750(±8.750) | 79.125(±80.125) |
DE Learns Team | 92.750(±2.750) | 62.125(±63.125) | 88.750(±19.750) | 72.375(±73.375) |
DE Learns Environment | 90.500(±4.500) | 76.750(±77.750) | 92.250(±3.250) | 77.750(±78.750) |
PLASTIC-Model | 91.375(±4.375) | 75.250(±76.250) | 91.750(±8.750) | 78.875(±79.875) |
After Learning | Eval on | Eval on | Eval on | Eval on |
DETEAMSK-Greedy policy | 58.625(±59.625) | 76.250(±77.250) | 81.250(±52.250) | 67.750(±68.750) |
DETEAMSK-Boltzman policy | 94.000(±7.750) | 56.375(±7.750) | 90.000(±11.000) | 87.375(±10.375) |
DETEAMSK-Epsilon greedy policy | 86.500(±12.500) | 48.625(±12.500) | 83.000(±33.000) | 78.375(±41.375) |
PLASTIC-Policy | 88.500(±8.500) | 87.750(±8.750) | 78.875(±79.875) | 78.375(±79.375) |
DE Learns Team | 87.750(±4.750) | 76.250(±77.250) | 81.250(±52.250) | 67.750(±68.750) |
DE Learns Environment | 87.625(±17.625) | 87.750(±5.750) | 89.375(±4.625) | 78.250(±79.250) |
PLASTIC-Model | 91.750(±4.750) | 74.500(±75.500) | 89.000(±26.000) | 77.250(±78.250) |
Baseline | Eval on | Eval on | Eval on | Eval on |
---|---|---|---|---|
Original Teammate | 90.000(±2.000) | 83.000(±0.000) | 89.875(±1.875) | ∖ |
Radom Policy | 78.250(±39.250) | 54.875(±49.875) | 67.250(±48.250) | 53.750(±51.750) |
UCT-Model | 89.125(±1.875) | 84.500(±5.500) | 89.375(±2.375) | 88.750(±2.250) |
After Learning | Eval on | Eval on | Eval on | Eval on |
DETEAMSK-Greedy policy | 90.375(±2.375) | 87.125(±4.125) | 90.125(±2.125) | 88.500(±3.500) |
DETEAMSK-Boltzman policy | 88.750(±2.250) | 57.875(±58.875) | 80.250(±24.250) | 87.000(±18.000) |
DETEAMSK-Epsilon greedy policy | 88.750(±1.250) | 79.625(±18.625) | 89.750(±1.750) | 86.625(±4.375) |
PLASTIC-Policy | 44.375(±46.625) | 31.250(±57.750) | −1.000(±0.000) | 55.000(±56.000) |
DE Learns Team | 89.875(±1.875) | 78.875(±14.875) | 89.250(±1.750) | 88.250(±2.750) |
DE Learns Environment | 89.625(±1.625) | 84.750(±5.250) | 89.375(±2.375) | 89.000(±2.000) |
PLASTIC-Model | 90.375(±2.375) | 72.750(±73.750) | 89.000(±2.000) | 88.125(±2.875) |
After Learning | Eval on | Eval on | Eval on | Eval on |
DETEAMSK-Greedy policy | 88.125(±7.125) | 84.750(±5.250) | 79.000(±80.000) | 86.750(±3.250) |
DETEAMSK-Boltzman policy | 83.625(±18.625) | 74.750(±17.750) | 77.625(±75.625) | 77.375(±22.375) |
DETEAMSK-Epsilon greedy policy | 89.375(±5.375) | 84.875(±5.125) | 88.500(±3.500) | 87.875(±4.875) |
PLASTIC-Policy | 78.250(±79.250) | 32.000(±57.000) | 44.875(±46.125) | 67.125(±68.125) |
DE Learns Team | 88.875(±2.125) | 83.750(±15.750) | 89.125(±9.125) | 88.750(±2.250) |
DE Learns Environment | 89.000(±2.000) | 85.625(±4.375) | 90.125(±2.125) | 87.875(±4.875) |
PLASTIC-Model | 90.125(±2.125) | 81.875(±21.875) | 89.125(±1.875) | 87.875(±6.875) |
After Learning | Eval on | Eval on | Eval on | Eval on |
DETEAMSK-Greedy policy | 90.000(±2.000) | 83.625(±6.625) | 89.750(±1.750) | 88.750(±5.750) |
DETEAMSK-Boltzman policy | 66.375(±67.375) | 72.250(±58.250) | 75.375(±76.375) | 84.625(±18.625) |
DETEAMSK-Epsilon greedy policy | 86.750(±14.750) | 79.500(±16.500) | 88.250(±11.250) | 87.250(±4.250) |
PLASTIC-Policy | 89.625(±2.625) | 53.375(±54.375) | 55.875(±56.875) | 76.625(±77.625) |
DE Learns Team | 89.125(±1.875) | 81.000(±12.000) | 89.750(±1.750) | 89.125(±1.875) |
DE Learns Environment | 89.500(±1.500) | 83.875(±6.125) | 88.875(±9.875) | 90.000(±2.000) |
PLASTIC-Model | 88.875(±2.125) | 84.750(±5.250) | 89.500(±1.500) | 87.375(±6.375) |
After Learning | Eval on | Eval on | Eval on | Eval on |
DETEAMSK-Greedy policy | 88.375(±5.375) | 83.500(±6.500) | 88.250(±12.250) | 89.125(±3.125) |
DETEAMSK-Boltzman policy | 77.250(±78.250) | 65.125(±78.250) | 68.250(±69.250) | 77.750(±40.750) |
DETEAMSK-Epsilon greedy policy | 88.375(±2.625) | 56.875(±2.625) | 86.875(±16.875) | 83.125(±46.125) |
PLASTIC-Policy | 78.750(±79.750) | 74.500(±75.500) | 78.000(±79.000) | 76.750(±77.750) |
DE Learns Team | 88.625(±2.375) | 82.500(±10.500) | 89.375(±1.625) | 89.500(±1.500) |
DE Learns Environment | 89.000(±2.000) | 84.750(±5.250) | 89.250(±2.250) | 88.625(±2.375) |
PLASTIC-Model | 90.000(±2.000) | 82.125(±7.875) | 89.000(±4.000) | 83.625(±36.625) |
Baseline | Eval on | Eval on | Eval on | Eval on |
---|---|---|---|---|
Original Teammate | 98.000(±0.000) | 98.000(±0.000) | 97.750(±1.750) | ∖ |
Radom Policy | 83.000(±39.000) | 83.000(±39.000) | 83.000(±39.000) | 83.000(±39.000) |
UCT-Model | 97.750(±0.750) | 98.000(±0.000) | 97.750(±1.750) | 97.875(±0.875) |
After Learning | Eval on | Eval on | Eval on | Eval on |
DETEAMSK | 97.750(±0.750) | 97.875(±0.875) | 97.875(±0.875) | 97.750(±0.750) |
PLASTIC-Policy | 98.000(±0.000) | 98.000(±0.000) | 98.000(±0.000) | 98.000(±0.000) |
DE Learns Team | 98.000(±0.000) | 98.000(±0.000) | 97.750(±0.750) | 97.875(±0.875) |
DE Learns Environment | 97.875(±0.875) | 97.750(±1.750) | 97.625(±1.625) | 97.875(±0.875) |
PLASTIC-Model | 98.000(±0.000) | 98.000(±0.000) | 97.500(±1.500) | 98.000(±0.000) |
After Learning | Eval on | Eval on | Eval on | Eval on |
DETEAMSK | 97.750(±0.750) | 98.000(±0.000) | 97.875(±0.875) | 97.875(±0.875) |
PLASTIC-Policy | 98.000(±0.000) | 98.000(±0.000) | 98.000(±0.000) | 98.000(±0.000) |
DE Learns Team | 98.000(±0.000) | 98.000(±0.000) | 97.625(±1.625) | 98.000(±0.000) |
DE Learns Environment | 98.000(±0.000) | 98.000(±0.000) | 97.625(±1.625) | 98.000(±0.000) |
PLASTIC-Model | 97.875(±0.875) | 98.000(±0.000) | 97.750(±0.750) | 97.750(±0.750) |
After Learning | Eval on | Eval on | Eval on | Eval on |
DETEAMSK | 97.500(±2.500) | 97.875(±0.875) | 97.875(±0.875) | 98.000(±0.000) |
PLASTIC-Policy | 98.000(±0.000) | 98.000(±0.000) | 97.750(±1.750) | 98.000(±0.000) |
DE Learns Team | 98.000(±0.000) | 97.750(±0.750) | 97.375(±1.375) | 97.875(±0.875) |
DE Learns Environment | 98.000(±0.000) | 97.750(±0.750) | 97.625(±1.625) | 97.875(±0.875) |
PLASTIC-Model | 97.875(±0.875) | 98.000(±0.000) | 98.000(±0.000) | 97.625(±1.625) |
After Learning | Eval on | Eval on | Eval on | Eval on |
DETEAMSK | 98.000(±0.750) | 97.500(±0.750) | 98.000(±0.000) | 97.750(±0.750) |
PLASTIC-Policy | 98.000(±0.000) | 98.000(±0.000) | 97.750(±1.750) | 98.000(±0.000) |
DE Learns Team | 97.750(±0.750) | 97.875(±0.750) | 98.000(±0.000) | 97.750(±0.750) |
DE Learns Environment | 97.500(±1.500) | 97.875(±1.500) | 97.375(±1.375) | 97.875(±0.875) |
PLASTIC-Model | 97.875(±0.875) | 98.000(±0.875) | 97.875(±0.875) | 97.875(±0.875) |
Baseline | Eval on | Eval on | Eval on | Eval on |
---|---|---|---|---|
Original Teammate | 93.000(±1.000) | 87.375(±2.375) | 92.250(±2.250) | ∖ |
Radom Policy | 58.125(±24.875) | 51.250(±40.250) | 52.375(±38.625) | 45.250(±40.250) |
UCT-Model | 82.125(±63.125) | 87.625(±2.625) | 90.750(±8.750) | 90.250(±5.250) |
After Learning | Eval on | Eval on | Eval on | Eval on |
DETEAMSK | 44.375(±49.625) | 19.000(±71.000) | 58.000(±59.000) | 49.250(±50.250) |
PLASTIC-Policy | 81.500(±82.500) | 55.250(±56.250) | 81.125(±82.125) | 66.500(±67.500) |
DE Learns Team | 80.500(±81.500) | 72.125(±73.125) | 91.375(±7.375) | 64.875(±65.875) |
DE Learns Environment | 44.875(±49.125) | 52.250(±53.250) | 44.000(±50.000) | 31.750(±59.250) |
PLASTIC-Model | 93.500(±1.500) | 72.000(±73.000) | 88.500(±23.500) | 85.125(±26.125) |
After Learning | Eval on | Eval on | Eval on | Eval on |
DETEAMSK | 22.375(±70.625) | 43.250(±44.750) | 79.125(±80.125) | 43.750(±48.250) |
PLASTIC-Policy | 92.625(±1.375) | 77.250(±78.250) | 79.750(±80.750) | 91.250(±6.250) |
DE Learns Team | 91.000(±6.000) | 76.000(±77.000) | 89.625(±3.625) | 98.000(±0.000) |
DE Learns Environment | 45.750(±48.250) | 12.125(±60.875) | 42.375(±49.625) | 36.625(±55.375) |
PLASTIC-Model | 89.625(±4.625) | 84.750(±4.750) | 91.375(±3.375) | 55.625(±56.625) |
After Learning | Eval on | Eval on | Eval on | Eval on |
DETEAMSK | 45.625(±48.375) | 21.875(±69.125) | 65.125(±66.125) | 54.250(±55.250) |
PLASTIC-Policy | 92.875(±1.125) | 76.875(±77.875) | 92.875(±1.125) | 78.500(±79.500) |
DE Learns Team | 92.000(±8.000) | 64.500(±65.500) | 92.500(±2.500) | 79.125(±80.125) |
DE Learns Environment | 28.750(±63.250) | 41.625(±46.375) | 45.125(±48.875) | 62.875(±63.875) |
PLASTIC-Model | 92.750(±1.750) | 86.875(±5.875) | 92.750(±1.250) | 76.375(±77.375) |
After Learning | Eval on | Eval on | Eval on | Eval on |
DETEAMSK | 63.875(±64.875) | 50.000(±64.875) | 52.625(±53.625) | 34.125(±59.875) |
PLASTIC-Policy | 92.625(±1.375) | 76.250(±1.375) | 79.625(±80.625) | 79.875(±80.875) |
DE Learns Team | 90.750(±5.750) | 76.750(±5.750) | 80.125(±81.125) | 74.500(±75.500) |
DE Learns Environment | 55.000(±56.000) | 50.875(±56.000) | 46.250(±47.750) | 62.625(±63.625) |
PLASTIC-Model | 89.125(±9.125) | 76.625(±9.125) | 91.000(±7.000) | 77.500(±78.500) |
Baseline | Eval on | Eval on | Eval on | Eval on |
---|---|---|---|---|
Original Teammate | 90.000(±2.000) | 83.000(±0.000) | 89.875(±1.875) | ∖ |
Radom Policy | 78.250(±39.250) | 54.875(±49.875) | 67.250(±48.250) | 53.750(±51.750) |
UCT-Model | 89.125(±1.875) | 84.500(±5.500) | 89.375(±2.375) | 88.750(±2.250) |
After Learning | Eval on | Eval on | Eval on | Eval on |
DETEAMSK | 88.625(±11.625) | 84.750(±20.750) | 78.000(±79.000) | 88.375(±5.375) |
PLASTIC-Policy | 90.875(±0.875) | 88.125(±5.125) | 90.500(±0.500) | 79.000(±80.000) |
DE Learns Team | 90.000(±6.000) | 84.250(±6.250) | 84.750(±29.750) | 87.125(±11.125) |
DE Learns Environment | 77.750(±78.750) | 74.250(±55.250) | 83.500(±30.500) | 89.000(±6.000) |
PLASTIC-Model | 88.375(±2.625) | 73.250(±68.250) | 90.000(±2.000) | 88.125(±4.125) |
After Learning | Eval on | Eval on | Eval on | Eval on |
DETEAMSK | 78.000(±79.000) | 58.875(±59.875) | 74.625(±75.625) | 73.000(±74.000) |
PLASTIC-Policy | 90.500(±0.500) | 86.625(±4.375) | 78.750(±79.750) | 90.375(±2.375) |
DE Learns Team | 85.500(±9.500) | 76.750(±27.500) | 89.875(±2.875) | 83.000(±16.000) |
DE Learns Environment | 87.000(±11.000) | 83.000(±17.000) | 90.000(±5.000) | 84.125(±22.125) |
PLASTIC-Model | 89.125(±2.625) | 83.875(±6.125) | 78.375(±66.375) | 89.875(±1.875) |
After Learning | Eval on | Eval on | Eval on | Eval on |
DETEAMSK | 77.250(±78.250) | 82.250(±9.250) | 89.875(±3.875) | 87.875(±6.875) |
PLASTIC-Policy | 89.625(±1.625) | 75.625(±76.625) | 90.375(±0.625) | 88.750(±3.750) |
DE Learns Team | 88.625(±2.375) | 65.750(±66.750) | 89.500(±3.500) | 88.375(±2.625) |
DE Learns Environment | 87.000(±10.000) | 86.000(±7.000) | 77.000(±78.000) | 82.375(±16.375) |
PLASTIC-Model | 88.375(±88.375) | 84.250(±5.750) | 89.625(±3.625) | 88.250(±7.250) |
After Learning | Eval on | Eval on | Eval on | Eval on |
DETEAMSK | 87.750(±13.750) | 77.625(±13.750) | 89.250(±3.250) | 88.625(±9.625) |
PLASTIC-Policy | 89.000(±3.000) | 87.875(±3.000) | 89.000(±4.000) | 90.125(±2.125) |
DE Learns Team | 89.875(±1.875) | 85.000(±1.875) | 88.875(±4.875) | 86.250(±13.250) |
DE Learns Environment | 83.625(±42.625) | 74.250(±42.625) | 90.625(±0.625) | 74.375(±75.375) |
PLASTIC-Model | 89.250(±1.750) | 79.750(±1.750) | 88.875(±2.125) | 86.625(±6.625) |
After Learning | Eval on | Eval on | Eval on | Eval on |
---|---|---|---|---|
DETEAMSK | 25.500(±175.500) | −49.125(±125.125) | 17.000(±167.000) | −8.750(±141.250) |
PLASTIC-Policy | 17.625(±167.625) | −63.625(±157.625) | −62.875(±121.875) | −69.000(±142.000) |
DE Learns Team | 80.000(±15.000) | 62.750(±98.750) | 51.875(±59.875) | 46.500(±45.500) |
After Learning | Eval on | Eval on | Eval on | Eval on |
DETEAMSK | 49.625(±60.625) | −28.500(±121.500) | −3.750(±146.250) | −15.750(±134.250) |
PLASTIC-Policy | 35.750(±185.750) | −102.250(±153.250) | −130.250(±138.250) | −100.500(±192.500) |
DE Learns Team | 70.250(±27.750) | 64.375(±51.375) | 51.250(±47.250) | 69.625(±60.625) |
After Learning | Eval on | Eval on | Eval on | Eval on |
DETEAMSK | 55.625(±34.625) | −63.000(±152.000) | −82.500(±174.500) | −47.750(±139.750) |
PLASTIC-Policy | −87.000(±138.000) | −67.375(±157.375) | −94.375(±167.375) | −93.250(±159.250) |
DE Learns Team | 86.750(±12.750) | 66.500(±51.500) | 67.375(±60.375) | 63.750(±59.750) |
After Learning | Eval on | Eval on | Eval on | Eval on |
DETEAMSK | 24.625(±174.625) | 24.375(±174.375) | −56.875(±125.875) | −6.625(±143.375) |
PLASTIC-Policy | −57.750(±142.750) | −69.125(±161.125) | −83.875(±88.875) | −94.250(±188.250) |
DE Learns Team | 79.500(±32.500) | 62.750(±30.750) | 57.375(±50.375) | 80.000(±17.000) |
References
- Bekmezci, I.; Sahingoz, O.K.; Temel, Ş. Flying ad-hoc networks (FANETs): A survey. Ad Hoc Netw. 2013, 11, 1254–1270. [Google Scholar] [CrossRef]
- Kamel, B.; Oussama, A. Cooperative Navigation and Autonomous Formation Flight for a Swarm of Unmanned Aerial Vehicle. In Proceedings of the 2021 5th International Conference on Vision, Image and Signal Processing (ICVISP), Kuala Lumpur, Malaysia, 18–20 December 2021; pp. 212–217. [Google Scholar]
- Shen, L.C.; Niu, Y.F.; Zhu, H.Y. Theories and methods of autonomous cooperative control for multiple UAVs. Sci.-Technol. Publ. 2013, 50–56. [Google Scholar]
- Zou, L.Y.; Zhang, M.Z.; Rong, M. Analysis of intelligent unmanned aircraft systems swarm concept and main development trend. Tactical Missile Technol. 2019, 5, 1–11. [Google Scholar]
- Hu, X. The Science of War: Understanding the Scientific Foundations and Thinking Methods of War; Science Press. 2018. Available online: https://fx.gfkd.chaoxing.com/detail_38502727e7500f26645e6677ee9b5d50f085fb46afe9a5921921b0a3ea25510134114c969f2eae5c338fa974ce0cee956c2520ce8449a34547b9173ffddc58c1fc5334fd739e739d0891d1a62223f3f3 (accessed on 1 December 2021).
- Zhang, B.C.; Liao, J.; Kuang, Y.; Zhang, M.; Zhou, S.L.; Kang, Y.H. Research status and development trend of the United States UAV swarm battlefield. Aero Weapon. 2020, 27, 7–12. [Google Scholar]
- Niu, Y.F.; Shen, L.C.; Li, J.; Wang, X. Key scientific problems in cooperation control of unmanned-manned aircraft systems. Sci. Sin. Inform. 2019, 49, 538–554. [Google Scholar] [CrossRef]
- Wang, Y.C.; Liu, J.G. Evaluation methods for the autonomy of unmanned systems. Chin. Sci. Bull. 2012, 57, 3409–3418. [Google Scholar] [CrossRef]
- Mirsky, R.; Carlucho, I.; Rahman, A.; Fosong, E.; Macke, W.; Sridharan, M.; Stone, P.; Albrecht, S.V. A survey of ad hoc teamwork research. In Proceedings of the European Conference on Multi-Agent Systems, Düsseldorf, Germany, 14–16 September 2022; Springer International Publishing: Cham, Switzerland, 2022; pp. 275–293. [Google Scholar]
- Bowling, M.; McCracken, P. Coordination and adaptation in impromptu teams. In Proceedings of the AAAI 2005, Pittsburgh, PA, USA, 9–13 July 2005; Volume 5, pp. 53–58. [Google Scholar]
- Rovatsos, M.; Wolf, M. Towards social complexity reduction in multiagent learning: The adhoc approach. In Proceedings of the 2002 AAAI Spring Symposium on Collaborative Learning Agents, Palo Alto, CA, USA, 25–27 March 2002; pp. 90–97. [Google Scholar]
- Stone, P.; Kaminka, G.; Kraus, S.; Rosenschein, J. Ad hoc autonomous agent teams: Collaboration without pre-coordination. In Proceedings of the AAAI Conference on Artificial Intelligence, Atlanta, GA, USA, 11–15 July 2010; Volume 24, pp. 1504–1509. [Google Scholar]
- Yuan, L.; Zhang, Z.; Li, L.; Guan, C.; Yu, Y. A survey of progress on cooperative multi-agent reinforcement learning in open environment. arXiv 2023, arXiv:2312.01058. [Google Scholar] [CrossRef]
- Barrett, S.; Rosenfeld, A.; Kraus, S.; Stone, P. Making friends on the fly: Cooperating with new teammates. Artif. Intell. 2017, 242, 132–171. [Google Scholar] [CrossRef]
- Barrett, S.; Stone, P. Cooperating with unknown teammates in complex domains: A robot soccer case study of ad hoc teamwork. In Proceedings of the AAAI Conference on Artificial Intelligence, Austin, TX, USA, 25–30 January 2015; Volume 29. [Google Scholar]
- Barrett, S.; Stone, P. An analysis framework for ad hoc teamwork tasks. In Proceedings of the 11th International Conference on Autonomous Agents and Multiagent Systems-Volume 1, Valencia, Spain, 4–8 June 2012; pp. 357–364. [Google Scholar]
- Barrett, S.; Stone, P.; Kraus, S. Empirical evaluation of ad hoc teamwork in the pursuit domain. In Proceedings of the 10th International Conference on Autonomous Agents and Multiagent Systems-Volume 2, Taipei, Taiwan, 2–6 May 2011; pp. 567–574. [Google Scholar]
- Barrett, S. Making Friends on the Fly: Advances in Ad Hoc Teamwork; Springer: Berlin/Heidelberg, Germany, 2015. [Google Scholar]
- Melo, F.S.; Sardinha, A. Ad hoc teamwork by learning teammates’ task. Auton. Agents Multi-Agent Syst. 2016, 30, 175–219. [Google Scholar] [CrossRef]
- Gu, P.; Zhao, M.; Hao, J.; An, B. Online ad hoc teamwork under partial observability. In Proceedings of the International Conference on Learning Representations, Online, 3–7 May 2021. [Google Scholar]
- Dodampegama, H.; Sridharan, M. Knowledge-based reasoning and learning under partial observability in ad hoc teamwork. Theory Pract. Log. Program. 2023, 23, 696–714. [Google Scholar] [CrossRef]
- Fujimoto, T.; Chatterjee, S.; Ganguly, A. Ad hoc teamwork in the presence of adversaries. arXiv 2022, arXiv:2208.05071. [Google Scholar] [CrossRef]
- Santos, P.M.; Ribeiro, J.G.; Sardinha, A.; Melo, F.S. Ad hoc teamwork in the presence of non-stationary teammates. In Proceedings of the EPIA Conference on Artificial Intelligence, Virtual, 7–9 September 2021; Springer International Publishing: Cham, Switzerland, 2021; pp. 648–660. [Google Scholar]
- Ribeiro, J.G.; Rodrigues, G.; Sardinha, A.; Melo, F.S. TEAMSTER: Model-based reinforcement learning for ad hoc teamwork. Artif. Intell. 2023, 324, 104013. [Google Scholar] [CrossRef]
- Fosong, E.; Rahman, A.; Carlucho, I.; Albrecht, S.V. Few-shot teamwork. arXiv 2022, arXiv:2207.09300. [Google Scholar] [CrossRef]
- Chandrasekaran, M.; Eck, A.; Doshi, P.; Soh, L. Individual planning in open and typed agent systems. In Proceedings of the Thirty-Second Conference on Uncertainty in Artificial Intelligence, New York, NY, USA, 25–29 June 2016; pp. 82–91. [Google Scholar]
- Ribeiro, J.G.; Faria, M.; Sardinha, A.; Melo, F.S. Helping people on the fly: Ad hoc teamwork for human-robot teams. In Proceedings of the Progress in Artificial Intelligence: 20th EPIA Conference on Artificial Intelligence, EPIA 2021, Virtual Event, 7–9 September 2021; Proceedings 20. Springer International Publishing: Cham, Switzerland, 2021; pp. 635–647. [Google Scholar]
- Ribeiro, J.G.; Martinho, C.; Sardinha, A.; Melo, F.S. Assisting unknown teammates in unknown tasks: Ad hoc teamwork under partial observability. arXiv 2022, arXiv:2201.03538. [Google Scholar] [CrossRef]
- Rahman, M.A.; Hopner, N.; Christianos, F.; Albrecht, S.V. Towards open ad hoc teamwork using graph-based policy learning. In Proceedings of the International Conference on Machine Learning, PMLR, Online, 18–24 July 2021; pp. 8776–8786. [Google Scholar]
- Rahman, A.; Carlucho, I.; Höpner, N.; Albrecht, S.V. A general learning framework for open ad hoc teamwork using graph-based policy learning. J. Mach. Learn. Res. 2023, 24, 1–74. [Google Scholar]
- Rahman, M.A. Advances in Open ad Hoc Teamwork and Teammate Generation. 2023. Available online: https://era.ed.ac.uk/handle/1842/40567 (accessed on 15 May 2023).
- Eck, A.; Shah, M.; Doshi, P.; Soh, L.K. Scalable decision-theoretic planning in open and typed multiagent systems. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 7127–7134. [Google Scholar]
- Jameson, S.; Franke, J.; Szczerba, R.; Stockdale, S. Collaborative autonomy for manned/unmanned teams. In Proceedings of the Annual Forum Proceedings-American Helicopter Society, Grapevine, TX, USA, 1–3 June 2005; American Helicopter Society, Inc.: Fairfax, VA, USA, 2005; Volume 61, p. 1673. [Google Scholar]
- Martinez, S. UAV cooperative decision and control: Challenges and practical approaches (shima, t. and rasmussen, s.; 2008) [bookshelf]. IEEE Control Syst. Mag. 2010, 30, 104–107. [Google Scholar]
- Zhu, S.; Wang, D.; Low, C.B. Cooperative control of multiple UAVs for moving source seeking. In Proceedings of the 2013 International Conference on Unmanned Aircraft Systems (ICUAS), Atlanta, GA, USA, 28–31 May 2013; pp. 193–202. [Google Scholar]
- Neves, A.; Sardinha, A. Learning to Cooperate with Completely Unknown Teammates. In Proceedings of the EPIA Conference on Artificial Intelligence, Lisbon, Portugal, 31 August–2 September 2022; Springer International Publishing: Cham, Switzerland, 2022; pp. 739–750. [Google Scholar]
- Sutton, R.S. Reinforcement Learning: An Introduction; A Bradford Book; 2018. Available online: https://www.cambridge.org/core/journals/robotica/article/abs/robot-learning-edited-by-jonathan-h-connell-and-sridhar-mahadevan-kluwer-boston-19931997-xii240-pp-isbn-0792393651-hardback-21800-guilders-12000-8995/737FD21CA908246DF17779E9C20B6DF6 (accessed on 13 July 2025).
- Kocsis, L.; Szepesvári, C. Bandit based monte-carlo planning. In Proceedings of the European Conference on Machine Learning, Berlin, Germany, 18–22 September 2006; Springer: Berlin/Heidelberg, Germany, 2006; pp. 282–293. [Google Scholar]
- Browne, C.B.; Powley, E.; Whitehouse, D.; Lucas, S.M.; Cowling, P.I.; Rohlfshagen, P.; Tavener, S.; Perez, D.; Samothrakis, S.; Colton, S. A survey of monte carlo tree search methods. IEEE Trans. Comput. Intell. AI Games 2012, 4, 1–43. [Google Scholar] [CrossRef]
- Gray, R.C.; Zhu, J.; Ontañón, S. Beyond UCT: MAB Exploration Improvements for Monte Carlo Tree Search. In Proceedings of the 2023 IEEE Conference on Games (CoG), Boston, MA, USA, 21–24 August 2023; pp. 1–8. [Google Scholar]
- Watkins, C.J.C.H. Dayan P. Q-learning. Mach. Learn. 1992, 8, 279–292. [Google Scholar]
- Watkins, C.J.C.H. Learning from delayed rewards. Robot. Auton. Syst. 1989. Available online: https://d1wqtxts1xzle7.cloudfront.net/50360235/Learning_from_delayed_rewards_20161116-28282-v2pwvq-libre.pdf?1479337768=&response-content-disposition=inline%3B+filename%3DLearning_from_delayed_rewards.pdf&Expires=1752568841&Signature=XNY6TdtsZ5zJIhAg5ztgppa3YkIA3r3HJIlgAqyxL9lGL~WYcjzuNKvMlBr4P2OsT8LrucS8A6zelmuCGVCB8l-u~4DL0bMM1XWUMG2Z0ee7WRnJNBMIx6eL8IBpHrpS-ZVYCLMierFuOMinOojOA6rPwByqDDgYKAxvtELo04AnBOUdU9TIOOT7V-lA1HXH1qxPxA6Cch2mzGoAtHD5zeVppGma4ooslHfY2DDC0aUw-2mWPwy98Ei4IlS3gxfoTa8hsE5jVXK4rOanJ7WPIOzfnyhA8OEKEzZLR0-DVE5Q6dD16HScrtTvdIqPfWD1HBvUHsArlV4VspjywznzEg__&Key-Pair-Id=APKAJLOHF5GGSLRBV4ZA (accessed on 13 July 2025).
- Kalman, B.L.; Kwasny, S.C. Why tanh: Choosing a sigmoidal function. In Proceedings of the [Proceedings 1992] IJCNN International Joint Conference on Neural Networks, Baltimore, MD, USA, 7–11 June 1992; Volume 4, pp. 578–581. [Google Scholar]
- Agarap, A.F. Deep learning using rectified linear units (relu). arXiv 2018, arXiv:1803.08375. [Google Scholar]
- Shi, N.; Li, D. Rmsprop converges with proper hyperparameter. In Proceedings of the International Conference on Learning Representation, Online, 3–7 May 2021. [Google Scholar]
- Zou, F.; Shen, L.; Jie, Z.; Zhang, W.; Liu, W. A sufficient condition for convergences of adam and rmsprop. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 11127–11135. [Google Scholar]
- Kingma, D.P. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
- Yourdshahi, E.S.; Pinder, T.; Dhawan, G.; Marcolino, L.S.; Angelov, P. Towards large scale ad-hoc teamwork. In Proceedings of the 2018 IEEE International Conference on Agents (ICA), Singapore, 28–31 July 2018; pp. 44–49. [Google Scholar]
- Csáji, B.C.; Monostori, L. Value function based reinforcement learning in changing Markovian environments. J. Mach. Learn. Res. 2008, 9, 1679–1709. [Google Scholar]
- Albrecht, S.V.; Stone, P. Autonomous agents modelling other agents: A comprehensive survey and open problems. Artif. Intell. 2018, 258, 66–95. [Google Scholar] [CrossRef]
- Deisenroth, M.P.; Neumann, G.; Peters, J. A survey on policy search for robotics. Found. Trends Robot. 2013, 2, 1–142. [Google Scholar]
- Guestrin, C.; Koller, D.; Parr, R.; Venkataraman, S. Efficient solution algorithms for factored MDPs. J. Artif. Intell. Res. 2003, 19, 399–468. [Google Scholar] [CrossRef]
- Kaelbling, L.P.; Littman, M.L.; Moore, A.W. Reinforcement learning: A survey. J. Artif. Intell. Res. 1996, 4, 237–285. [Google Scholar] [CrossRef]
- Szepesvári, C. Algorithms for Reinforcement Learning; Springer Nature: Berlin/Heidelberg, Germany, 2022. [Google Scholar]
- Bellemare, M.G.; Naddaf, Y.; Veness, J.; Bowling, M. The arcade learning environment: An evaluation platform for general agents. J. Artif. Intell. Res. 2013, 47, 253–279. [Google Scholar] [CrossRef]
- Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef] [PubMed]
- Buşoniu, L.; Babuška, R.; De Schutter, B. Multi-agent reinforcement learning: An overview. In Innovations in Multi-Agent Systems and Applications-1; Springer: Berlin/Heidelberg, Germany, 2010; pp. 183–221. [Google Scholar]
- Kool, W.; Van Hoof, H.; Welling, M. Attention, learn to solve routing problems! arXiv 2018, arXiv:1803.08475. [Google Scholar]
- Melo, F.S.; Veloso, M. Learning of coordination: Exploiting sparse interactions in multiagent systems. In Proceedings of the 8th International Conference on Autonomous Agents and Multiagent Systems-Volume 2, Budapest, Hungary, 10–15 May 2009; pp. 773–780. [Google Scholar]
- Hu, Y.; Gao, Y.; An, B. Learning in Multi-agent Systems with Sparse Interactions by Knowledge Transfer and Game Abstraction. In Proceedings of the AAMAS 2015, Istanbul, Turkey, 4–8 May 2015; pp. 753–761. [Google Scholar]
- Melo, F.S.; Sardinha, A.; Belo, D.; Couto, M.; Faria, M.; Farias, A.; Gambôa, H.; Jesus, C.; Kinarullathil, M.; Lima, P.; et al. Project INSIDE: Towards autonomous semi-unstructured human–robot social interaction in autism therapy. Artif. Intell. Med. 2019, 96, 198–216. [Google Scholar] [CrossRef] [PubMed]
- Chen, S.; Andrejczuk, E.; Cao, Z.; Zhang, J. Aateam: Achieving the ad hoc teamwork by employing the attention mechanism. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 7095–7102. [Google Scholar]
- Lowe, R.; Wu, Y.I.; Tamar, A.; Harb, J.; Pieter Abbeel, O.; Mordatch, I. Multi-agent actor-critic for mixed cooperative-competitive environments. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar]
- Zhang, Y.; Zavlanos, M.M. Cooperative Multiagent Reinforcement Learning With Partial Observations. IEEE Trans. Autom. Control 2023, 69, 968–981. [Google Scholar] [CrossRef]
- Fernandez, R.; Asher, D.E.; Basak, A.; Sharma, P.K.; Zaroukian, E.G.; Hsu, C.D.; Dorothy, M.R.; Kroninger, C.M.; Frerichs, L.; Rogers, J.; et al. Multi-Agent Coordination for Strategic Maneuver with a Survey of Reinforcement Learning; tech. rep.; US Army Combat Capabilities Development Command, Army Research Laboratory: 2021. Available online: https://apps.dtic.mil/sti/html/trecms/AD1154872/ (accessed on 1 December 2021).
- Albrecht, S.V.; Ramamoorthy, S. Comparative evaluation of MAL algorithms in a diverse set of ad hoc team problems. In Proceedings of the 11th International Conference on Autonomous Agents and Multiagent Systems-Volume 1, Valencia, Spain, 4–8 June 2012; pp. 349–356. [Google Scholar]
- Arafat, M.Y.; Moh, S. A Q-learning-based topology-aware routing protocol for flying ad hoc networks. IEEE Internet Things J. 2021, 9, 1985–2000. [Google Scholar] [CrossRef]
- Yoo, J.; Jang, D.; Kim, H.J.; Johansson, K.H. Hybrid reinforcement learning control for a micro quadrotor flight. IEEE Control Syst. Lett. 2020, 5, 505–510. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Xu, P.; Zhang, Y.; Hao, L.; Yan, Q. DETEAMSK: A Model-Based Reinforcement Learning Approach to Intelligent Top-Level Planning and Decisions for Multi-Drone Ad Hoc Teamwork by Decoupling the Identification of Teammate and Task. Aerospace 2025, 12, 635. https://doi.org/10.3390/aerospace12070635
Xu P, Zhang Y, Hao L, Yan Q. DETEAMSK: A Model-Based Reinforcement Learning Approach to Intelligent Top-Level Planning and Decisions for Multi-Drone Ad Hoc Teamwork by Decoupling the Identification of Teammate and Task. Aerospace. 2025; 12(7):635. https://doi.org/10.3390/aerospace12070635
Chicago/Turabian StyleXu, Penghui, Yu Zhang, Le Hao, and Qilin Yan. 2025. "DETEAMSK: A Model-Based Reinforcement Learning Approach to Intelligent Top-Level Planning and Decisions for Multi-Drone Ad Hoc Teamwork by Decoupling the Identification of Teammate and Task" Aerospace 12, no. 7: 635. https://doi.org/10.3390/aerospace12070635
APA StyleXu, P., Zhang, Y., Hao, L., & Yan, Q. (2025). DETEAMSK: A Model-Based Reinforcement Learning Approach to Intelligent Top-Level Planning and Decisions for Multi-Drone Ad Hoc Teamwork by Decoupling the Identification of Teammate and Task. Aerospace, 12(7), 635. https://doi.org/10.3390/aerospace12070635