Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Multi-Agent Deep Reinforcement Learning for Collision-Free Posture Control of Multi-Manipulators in Shared Workspaces

Sensors 2025, 25(22), 6822; https://doi.org/10.3390/s25226822

by Hoyeon Lee, Chenglong Luo and Hoeryong Jung^*

Reviewer 1: Anonymous

Reviewer 2:

Juvenal Rodríguez-Reséndiz

Reviewer 3: Anonymous

Reviewer 4: Anonymous

Reviewer 5: Anonymous

Sensors 2025, 25(22), 6822; https://doi.org/10.3390/s25226822

Submission received: 8 August 2025 / Revised: 5 November 2025 / Accepted: 6 November 2025 / Published: 7 November 2025

(This article belongs to the Section Intelligent Sensors)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

This paper addresses the issue of morphological diversity in robots and proposes a framework called GMAIL for group adversarial imitation learning, aiming to train a universal strategy that can adapt to different morphologies. However, there are still some issues that need to be addressed.

The specific revision suggestions are as follows:

The CTDE architecture and MAPPO algorithm adopted in this paper are relatively mature frameworks in the multi-agent field. So, is the main contribution of this paper the successful application of existing algorithms to the specific scenario of multi-robot arm control, along with effective engineering implementation and optimization, or the proposal of new improvements at the algorithm or framework level?
From the results in Tables 5 and 6, in certain scenarios (such as Scenario 4), the success rate of the PB method (84%) is even slightly higher than that of the LB method (83%). Although LB has a significant advantage in convergence time, a more in-depth explanation is needed for this phenomenon.
The tasks in the experiment (reaching the target point, picking and placing) mainly verified the position control of the end effector. As the author acknowledged in the discussion section, the posture of the end effector was not taken into account. In many practical applications (such as assembly, grasping objects in specific postures), posture control is essential. Will the current framework be severely limited in its applicable scenarios without introducing posture control?
Figure 5 presents the learning curve, which demonstrates the convergence of the algorithm. Are these curves the average results of multiple experiments? If so, can the shaded areas representing the standard deviations be added to show the stability of the training process?
In the pick-and-place task described in Section 3.4, the efficiency of the multi-robot arm system (8-13 seconds) is significantly higher than that of the single-arm system (18-29 seconds). This conclusion is intuitive. Was this efficiency improvement solely due to the parallelization of the task, or did the MADRL strategy learn some form of "division of labor and collaboration"?
Some recent studies especially on RL should be discussed, such as "Adaptive critic design for safety-optimal FTC of unknown nonlinear systems with asymmetric constrained-input", and "Event-triggered H∞ control for unknown constrained nonlinear systems with application to robot arm", which may enhance the credibility and the impact of the manuscript.
In Section 3.3.2, this paper compares the "segment-based (LB)" method proposed with the "keypoint-based (PB)" method. How are the number and positions of the key points in the PB method determined? Will the density of the key points affect the performance and computational efficiency of the PB method?

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

Add an introduction between sections and subsections
The quality of the images should be improved as they appear pixelated.
Ensure that all variables and parameters in the equations are described in the text. For example, m and n in equation 3.
It is recommended to add updated references to 2024 and 2025.
It is not entirely clear how cases involving non-cylindrical geometries or tools attached to the end effector that are more complex than a gripper are handled.
The collision threshold and safety margin (ϵ=0.05) are set arbitrarily without a sensitivity analysis.
The training was carried out for 36,000 steps with 1,024 parallel environments. However, the criteria for choosing these parameters are not explained, nor is it explained whether stability was evaluated with fewer resources.
There is no description of the buffer size or how variance was controlled in the odds estimation (only GAE is mentioned).
The reward is composed of several terms, but the weighting coefficients are not detailed, nor is their empirical adjustment.
MAPPO is used, but without justifying the choice compared to alternatives (HATRPO, QMIX, MADDPG). Recent literature shows that MAPPO's performance is not always superior in dense coordination tasks.
There is no stability analysis: only average convergence is shown, but typical MARL problems, such as the non-stationarity of the joint policy or techniques to mitigate it, are not discussed.
Success and collision rates are used, but more informative metrics for RL, such as cumulative return per episode, reward distribution, or learning rate per agent, are not. This limits understanding of learned behavior.
The conclusion that MAPPO + line segments is superior may be true in this setting, but without further baselines, it is unclear whether the improvement is attributable to the algorithm, the representation, or the reward function.
There is no discussion about the real-time inference cost: even if training converges, it is not proven that the policy can be executed with low latency in real-world control.
It is not reported whether the trained policies maintain performance when varying initial conditions (random seeds, starting positions), which is essential in MARL to validate robustness.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 3 Report

Comments and Suggestions for Authors

Dear Authors, please find my comments, questions, and suggestions below.

There is a typo in the References section. Please remove items 32 and 33 from there.

In my opinion, you should clearly state the novelty of your proposed solution and its differences from existing approaches in the introduction and abstract sections. This will help readers better understand the objectives of your research.

In line 268 you wrote: “In our experiments, the threshold was set to 𝛿_targ= 0.05m”. Why did you choose this specific value? Could you please provide a brief explanation? This information might be helpful to some colleagues or researchers in the field.

In section 3.1, you describe the scenarios used for reinforcement learning. How closely do they relate to real-world technical tasks? What inspired you to choose this set of test cases? Could you provide any links or examples of similar tasks in the industry?

The success rate of the proposed method varies from 83% to 97% in different scenarios. What is the minimum acceptable value for this parameter in industrial control systems? Is there an area where your solution could be implemented now in its current form? I believe it would be valuable if you could discuss this matter briefly.

In general, I think your Article deserves publication in the Sensors Journal after revision.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 4 Report

Comments and Suggestions for Authors

It was great review of this manuscript, which proposes a Multi-Agent Deep Reinforcement Learning (MADRL) framework for the real-time collision-free pose control of Multi-Manipulators systems in a shared workspace. After reviewing the manuscript, I have identified several points that could further enhance the quality and clarity of the research:

In page 3 (Figure 1): The purpose of the arrow from panel (a) to panel (b) is unclear. Clarify how the Line-segment Based representation process (a) conceptually flows into the larger Centralized Training and Decentralized Execution (CTDE) architecture (b).
In page 5 (Figure 2): Enhance the visual representation of the process by implementing a clear and consistent use of dotted lines and directional arrows. Ensure that the workflow progression is intuitively mapped out, with each step seamlessly connecting to the next through well-defined visuals.
In page 7 (2.5. Reward Function Design): Correct the typographical error in Equation (14) by replacing with to maintain consistency with the text accompanying text and other mathematical expressions.
In page 8 (2.5. Reward Function Design): Refine Equation (16) to provide a more precise and formal representation of the distance between the gripper and the target. Consider using more explicitly its mathematical notation that clearly define .
In page 8 (2.5. Reward Function Design): To improve transparency and facilitate result reproducibility, explicitly specify the numerical values assigned to the weight parameters () in total reward function of Equation (20). Provide the specific values used during the experimental setup, which will help other researchers understand and potentially replicate the reward weighting strategy.
In page 10 (Figure 3): Add a symbol or clarification in the diagram to show how the “State, Action, Reward” and “Observation” signals are aggregated or combined into before entering the Replay Buffer.
In page 18 (3.4. Performance Comparison in Task Environment): The use of a period in the abbreviation within the main text is confusing, as it can be mistaken for the end of a sentence. Remove the period from the abbreviation in prose to avoid ambiguity.

Comments for author File: Comments.pdf

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 5 Report

Comments and Suggestions for Authors

The paper lacks real-world hardware validation, and all experiments are conducted in simulation (Isaac Sim), raising concerns about the Sim-to-Real gap.
The orientation of the end-effector is not considered in the state representation, limiting applicability to tasks requiring precise orientation control such as assembly.
The comparison baseline is narrow, restricted to reinforcement learning approaches (IPPO, PB representation) without benchmarking against classical planners (RRT, PRM, CHOMP, TrajOpt) or recent hybrid approaches.
The scalability of the proposed method is insufficiently validated; experiments are limited to three manipulators, and performance with larger numbers remains unknown.
Computational resource requirements are not adequately addressed; training depends on high-performance GPUs, and the feasibility in constrained industrial environments is unclear.
The reward design is hand-crafted and task-specific, raising concerns about generalization to different industrial tasks without redesign.
The success rates, while relatively high, drop notably under dense environments (83% in Scenario 4), which may not be acceptable in safety-critical industrial applications.
The paper does not quantify failure cases in detail (e.g., types of collisions, causes of task failures), leaving gaps in understanding policy robustness.
The proposed line-segment state representation, while efficient, may oversimplify link geometry and underestimate collision risks in complex shapes or irregular manipulators.
The experimental tasks (pick-and-place with cubes) are relatively simple and do not convincingly demonstrate applicability to complex industrial operations.
The convergence analysis lacks statistical rigor; only mean values are reported, without confidence intervals or significance testing across runs.
The discussion section acknowledges limitations but does not provide concrete plans for addressing Sim-to-Real transfer, orientation control, or benchmarking in future work.

Comments for author File: Comments.pdf

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

This paper addresses the challenge of collision-free posture control for multiple robotic arms operating within a shared workspace by proposing a multi-agent deep reinforcement learning (DRL) framework based on a segmented link representation and a centralized training-decentralized execution (CTDE) architecture. While the paper presents a relatively complete framework, its main issue lies in the lack of novelty.

Meanwhile there remain numerous areas requiring improvement in theoretical derivation, experimental validation and analysis, and the overall scheme.

The specific comments are as follows:

Although the proposed method achieves a success rate exceeding 89% (reaching 97% in some scenarios) for 2-robotic-arm systems, the success rate drops to 83% when scaled to 3 robotic arms. Industrial scenarios typically require a success rate of ≥90%, and the collaborative performance with 4 or more robotic arms has not been tested.
As the number of robotic arms increases, collision risks rise nonlinearly. The existing collision detection mechanisms and reward functions struggle to efficiently coordinate the dynamic interactions among multiple agents, leading to reduced policy stability.
Comparisons were only made with basic multi-agent reinforcement learning (MARL) algorithms (e.g., IPPO), traditional state representation methods, and classical motion planning approaches. No comparisons were conducted with mainstream MARL algorithms that also adopt the CTDE architecture. This failure to demonstrate the advantages of the proposed MAPPO-based framework among CTDE-type algorithms, coupled with the lack of engagement with cutting-edge MARL research, limits the persuasiveness of the conclusions.
The 4 experimental scenarios designed in the paper (parallel forward movement, face-to-face interaction, cross-movement of two arms, and cross-movement of three arms) are all static and structured environments: target positions are fixed, there are no dynamic obstacles, and the initial postures of the robotic arms are only randomly sampled without simulating failures or temporary withdrawal. Such scenarios fail to replicate the dynamic nature of real-world industrial environments.
Comparisons with alternative state representation methods are insufficient: the paper only compares its approach with the "Key Point Method (PB)" and does not evaluate other efficient state representation techniques. This makes it impossible to verify that the "segment representation method" is the optimal solution balancing efficiency and accuracy.
The segment representation assumes that robotic arm links are cylindrical. However, real industrial robotic arms often include non-cylindrical components such as square links, multi-fingered grippers, and specialized tools. In such cases, the segment representation loses critical geometric information, leading to misjudgments in collision detection. The paper fails to propose a solution adaptable to non-cylindrical structures, which restricts the hardware compatibility of the method.
Lack of collision emergency handling mechanisms: In industrial settings, robotic arms must stop immediately and trigger an alarm upon collision. However, the paper’s experiments and training do not simulate this safety mechanism—when a collision is detected, only a penalty of -2.0 is imposed, with no "emergency braking" logic designed. This oversight could lead to equipment damage or human casualties in practical applications.
The paper states that the "learning curve is based on the average reward from 100 independent runs" but provides no details regarding: (a) the range of random seed values used; (b) whether constraints were applied to the random sampling ranges for the initial postures of the robotic arms and target positions. If the random sampling range is too narrow, experimental results may be biased toward specific scenarios, lacking generalizability.
In industrial applications, robotic arms often carry workpieces of varying weights. Load variations directly affect joint torque, motion accuracy, and response speed. However, in the paper’s experiments, the robotic arms were consistently operated under "no-load" or "fixed-load" conditions.
Energy consumption is a critical economic metric in industrial applications, yet the paper did not measure the energy consumption (e.g., motor power consumption) of the robotic arms when executing the proposed policy. For instance, while the segment representation method enables fast convergence, it may result in more complex motion trajectories for the robotic arms and thus higher energy consumption. The paper fails to conduct a trade-off analysis between these factors.
The overview is incomplete. In fact, many recent achievements in the field of reinforcement learning have not been discussed, such as" Self-triggered approximate optimal neuro-control for nonlinear systems through adaptive dynamic programming" , "Adaptive critic design for safety-optimal FTC of unknown nonlinear systems with asymmetric constrained-input" and "Noise suppression zeroing neural network for online solving the time-varying inverse kinematics problem of four-wheel mobile manipulators with external disturbances", which may enhance the credibility and provide a broader perspective of the manuscript.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

Please, provide evidence of real-life implementation.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 5 Report

Comments and Suggestions for Authors

The scalability of the proposed method is limited. Performance drops significantly when the number of manipulators increases.

End-effector orientation control is not considered, which restricts applicability to real industrial tasks.

The method relies solely on simulation. No real-world validation is provided.

The comparison baseline is narrow. Recent multi-agent reinforcement learning methods are missing.

Success rates in complex scenarios remain insufficient for practical deployment.

Safety, robustness, and industrial constraints are not adequately addressed.

The paper would benefit from a broader review of related works and a stronger theoretical foundation. It suggest the authors to read “A novel muscle-computer interface for hand gesture recognition using depth vision” and “A Review of AIoT-based Human Activity Recognition: From Application to Technique”. These works may provide useful perspectives on human–robot interaction, sensing, and AI integration. Integrating such insights may help enhance the quality and impact of the manuscript.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Round 3

Reviewer 1 Report

Comments and Suggestions for Authors

While the completeness of the model architecture has been enhanced and details of experimental design have been refined through revisions, core limitations remain unaddressed. Critical issues that impact the work’s academic rigor and industrial applicability persist, which renders the paper unqualified for immediate acceptance. A major revision is therefore recommended to resolve these deficiencies.

Improper citation. ’Inspired by its success in other…’, there should be references cited here. The use of professional terminology is also not standardized, such as CTDE. The name of figure 1.a and 1.b should be below the figure. Figure 2 clearly has four subgraphs, which should be named separately like figure 1.In the figure 3, the i in the Lossi, actori,oi and ai should be written as n, corresponding to the following Env.n.And n is the number of agents.
The success rate (SR) in the four-manipulator scenario is only 80%, significantly below the >90% reliability threshold required for industrial deployment. Notably, the manuscript fails to conduct a root-cause analysis of failures—for instance, whether they stem from inaccuracies in inter-link distance calculation, delays in multi-agent policy coordination, or limitations in the state representation itself. Meanwhile, the LS-based representation relies on the unvalidated assumption that manipulator links are cylindrical. No experiments or discussions are provided regarding its applicability to non-cylindrical structures (e.g., multi-fingered grippers or specialized end-effectors), severely restricting the method’s generalizability to real-world manipulator designs.
When comparing against Centralized Training and Decentralized Execution (CTDE) methods, the manuscript excludes mainstream multi-agent reinforcement learning algorithms (e.g., QMIX, VDN) that are widely adopted in similar multi-robot coordination tasks. Furthermore, it does not elucidate the fundamental mechanisms underlying MAPPO’s superiority—such as how the centralized critic efficiently leverages global state information to outperform alternative CTDE variants.Moreover, key metrics relevant to real-world industrial use (e.g., energy consumption during task execution, joint torque loads, or trajectory smoothness) are not measured or reported. Without these data, the manuscript cannot demonstrate the method’s practical viability beyond simulated performance.
The limitations section only touches on minor issues (e.g., scalability to more manipulators, cylindrical link assumptions) but ignores critical challenges, such as the exponential increase in training computational overhead with growing manipulator numbers, or the method’s robustness in dynamic environments with moving obstacles.
Key hyperparameters of the MAPPO algorithm (e.g., entropy coefficient, reward component weights) are presented without justification for their selection. Sensitivity analyses to demonstrate how variations in these parameters impact performance are also absent, weakening the reproducibility of the work.
Reinforcement learning is a rapidly developing field, especially since this technology is also used in this manuscript. Therefore, recent discussions on related achievements such as "Reinforcement learning-based secure tracking control for nonlinear interconnected systems: An event-triggered solution approach", "Observer based fault tolerant control design for saturated nonlinear systems with full state constraints via a novel event-triggered mechanism", and "A lightweight network enhanced by attention-guided cross-scale interaction for underwater object detection", may greatly improve the persuasiveness and quality of papers.

Comments on the Quality of English Language

For example, the punctuation at the end of this sentence ‘In our experiments...’ is missing.

The statement below Formula 19 should be written in the top cell. The end of Formula 20 should be a period.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

The manuscript has been improved, it can be considered for publication.

Author Response

Comment 1: The manuscript has been improved, it can be considered for publication.

Response 1: We thank the reviewer for the positive comments and for considering our revised manuscript suitable for publication.

Reviewer 5 Report

Comments and Suggestions for Authors

The current version is ready for publicaiton.

Author Response

Comment 1: The current version is ready for publicaiton.

Response 1: We thank the reviewer for the positive comments and for considering our revised manuscript suitable for publication.

Article Menu

Multi-Agent Deep Reinforcement Learning for Collision-Free Posture Control of Multi-Manipulators in Shared Workspaces

Further Information

Guidelines

MDPI Initiatives

Follow MDPI