Kuramoto Object-Centric Reinforcement Learning for Robotic Manipulation Tasks
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsThis paper proposes KORL (Kuramoto Object-Centric Reinforcement Learning), an object-centric model-based reinforcement learning method. The method first introduces the Kuramoto Slot Attention for Video (KSAVi) model, which combines Slot Attention with Kuramoto oscillatory neurons to extract robust object representations from visual inputs. Subsequently, based on a pre-trained KSAVi encoder and a graph neural network (GNN) world model, the KORL algorithm is constructed to leverage structured object representations for planning. Experiments on visually diverse robotic manipulation tasks show that KORL achieves more robust performance than existing object-centric model-free and model-based reinforcement learning baselines, and exhibits zero-shot generalization under significant changes in scene appearance, whereas monolithic model-based methods fail. Furthermore, replacing KSAVi with SlotContrast, a state-of-the-art object-centric video model, leads to performance degradation, further validating the suitability of KSAVi for reinforcement learning tasks. Here are detailed comments:
- Section 3.4: It is recommended to add numbering to the formulas in "Kuramoto Slot Attention for Video" and to check for any extraneous comma symbols at the end of formulas throughout the paper.
- Section 3.6.1: The authors replaced the encoder of OCRL with KSAVi, citing that SLATE could not be successfully trained. It is suggested to clarify which specific environments SLATE failed in, whether hyperparameter tuning was attempted, and to discuss whether the modified OCRL can still be regarded as the original method.
- Section 3.8 and Table 3: Success rates and returns from different environments are directly normalized and then averaged. Given the different dimensions and distribution characteristics of the two metrics, it is recommended to explain the rationale behind this aggregation approach.
- Section 3.8: Exponential smoothing (parameter 0.3) is used in Figure 5. It is suggested to explain the basis for selecting this smoothing parameter, or to also present the raw curves to avoid masking early-stage fluctuations.
-It is suggested to review recent reinforcement learning based control methods, discuss the model based and data based methods, on policy or off policy methods, and if possible it is suggested to compare the proposed control with typical methods.
- References: The reference formats in the paper are inconsistent (e.g., the way arXiv IDs are presented differs across entries). It is recommended to unify the reference format throughout the paper.
Author Response
We thank you for your valuable feedback. Please see our response below. The updated version of our manuscript is attached.
Comment 1. Section 3.4: It is recommended to add numbering to the formulas in "Kuramoto Slot Attention for Video" and to check for any extraneous comma symbols at the end of formulas throughout the paper.
Response 1. We added numbering to the formulas in Section 3.4 "Kuramoto Slot Attention for Video" and rearranged formulas in Sections 3.1 and 3.2 (pages 5-7, 9, 10).
Comment 2a. Section 3.6.1: The authors replaced the encoder of OCRL with KSAVi, citing that SLATE could not be successfully trained. It is suggested to clarify which specific environments SLATE failed in, whether hyperparameter tuning was attempted ...
Response 2a. We extended Appendix A.4 (OCRL) with information about SLATE, including the hyperparameters we tested and examples of attention maps produced by SLATE (page 19, line 640). We also added Figure A2 (page 21), which provides learning curves comparing OCRL vs. OCRL(SLATE), showing that replacing SLATE with KSAVi does not degrade the performance of OCRL. In Subsection 3.6.1 Baselines (page 12, line 424), we direct readers to the additional information about SLATE in Appendix A.4.
Comment 2b. ... discuss whether the modified OCRL can still be regarded as the original method
Reponse 2b. The authors of OCRL proposed a general architecture for a PPO agent based on a Transformer encoder to fuse object-centric representations. They integrated it with different object-centric representation models and obtained the best results with the SLATE model. We view the proposed Transformer-based PPO agent as a general framework that can be integrated with different object-centric representation models. Since both SLATE and KSAVi use the Slot Attention module, no architectural changes are needed to integrate KSAVi with the OCRL Transformer backbone. We believe it is still valid to call this baseline OCRL, as its architecture follows the original Transformer-based PPO agent.
Comment 3. Section 3.8 and Table 3: Success rates and returns from different environments are directly normalized and then averaged. Given the different dimensions and distribution characteristics of the two metrics, it is recommended to explain the rationale behind this aggregation approach.
Response 3. We introduce the mean normalized score as a measure of the robustness of an object-centric agent's learning. It should be high for agents that demonstrate consistently good performance across a diverse set of tasks, and it penalizes agents that perform very well on some tasks but completely fail on others. This idea is inspired by the mean human-normalized score [1] used for the Atari benchmark, which provides an average score across 57 games whose raw scores range from tens to hundreds of thousands. We extended Section 3.8 (page 13, line 478) to describe our motivation for introducing the mean normalized score.
[1] Adrià Puigdomènech Badia, Bilal Piot, Steven Kapturowski, Pablo Sprechmann, Alex Vitvitskyi, Daniel Guo, Charles Blundell (2020) Agent57: Outperforming the Atari Human Benchmark
Comment 4. Section 3.8: Exponential smoothing (parameter 0.3) is used in Figure 5. It is suggested to explain the basis for selecting this smoothing parameter, or to also present the raw curves to avoid masking early-stage fluctuations.
Response 4. There was a mistake in the manuscript: we use an exponential smoothing coefficient of 0.7, not 0.3. We selected this coefficient as a trade-off between reducing noise and still prioritizing recent performance. To remove any doubt, we added Figures A3 and A4 (page 23), which show the curves without exponential smoothing applied.
Comment 5. It is suggested to review recent reinforcement learning based control methods, discuss the model based and data based methods, on policy or off policy methods, and if possible it is suggested to compare the proposed control with typical methods.
Response 5. We added Section 2.3 "Reinforcement Learning" (page 3, line 97), where we discuss reinforcement learning-based control methods. We also conducted additional experiments with the recent model-free off-policy algorithm SDAC; details of training SDAC are provided in Appendix A.3 (page 19, line 633). We do not compare with model-free on-policy algorithms, as they are less sample-efficient than off-policy methods, and in this research we primarily focus on the sample efficiency of the considered methods. Among model-based RL algorithms, TD-MPC2 is considered state-of-the-art in continuous control tasks. Another strong baseline, especially for image-based tasks, is DreamerV3. Results for both TD-MPC2 and DreamerV3 are already presented in Table 2 (page 15). We added Figure A5 (page 24) to the Appendix, where we compare KORL with monolithic (non-object-centric) algorithms: SDAC, TD-MPC2, and DreamerV3.
Comment 6. References: The reference formats in the paper are inconsistent (e.g., the way arXiv IDs are presented differs across entries). It is recommended to unify the reference format throughout the paper.
Response 6. We unified the reference formats.
Author Response File:
Author Response.pdf
Reviewer 2 Report
Comments and Suggestions for AuthorsThis paper presents a work to address robotic manipulation tasks in an open field. They proposed an approach based on RL. My specific comments/questions are as follows.
Comment 1:
Quoted from the manuscript: “However, studies show that RL performance can benefit from better representations [6,7], as end-to-end training with only standard RL objectives often yields suboptimal policies.”
Comment/question: in the quoted sentence, better representation of what? The sentence sounds not clear. Please also give the support to the above statement, e.g. giving references.
Comment 2:
It is not clear about what challenge in the problem studied in the present paper. Is the challenge about object identification? Planning for grasping?
Comment 3:
The idea in the method developed in this paper seems to “divide the whole into pieces”, specifically, treating a group of objects as a whole with attention to the feature of this whole (i.e., the whole system) into individual objects with attention to the feature of each object. In my view, the idea of object-centric is of course more correct, but I hope to provide more supports that the existing literature is not object-centric. In my view, in the context of robotic manipulation tasks, the tasks must be conducted in a way that each object is individually handled.
Comment 4:
How about robust and resilient with the proposed method. The paper seems to claim that their method is robust, see quoted “Our approach introduces a novel Kuramoto Slot Attention for Video (KSAVi) model that integrates Kuramoto oscillatory neurons with the Slot Attention module to ‘robustly’ extract object representations.” First, there is not any experiment or simulation to test the robustness of the proposed method. Second, I am not sure about what definition of robustness is taken in the present paper; the definition of robustness and resilience in the context of robotics can be found in the literature, see ‘A Novel Resilient Robot: Kinematic Analysis and Experimentation’ (IEEE Access), from which one can see the difference of resilience from robustness. Therefore, the resilience of the proposed method should be a practical concern as there will be some disturbances that may damage the system or stretch the capability of the manipulator beyond its limit.
Comment 5:
Quoted from the manuscript: “Automating routine tasks enhances productivity across industries, from defect-based object sorting to automotive welding. Classical continuous control algorithms can achieve this but require meticulous tuning to handle edge cases; unexpected failures may halt production entirely.”
Comment: Please clarify the concept of continuous control. It is noted that the term ‘continuous’ is often used as opposed to “discrete event”.
Comment 6:
Please also clarify the world model in the present paper.
Comment 7:
Please compare the proposed method for robotic manipulation tasks with the latest method based on the RNN and MPC for mobile robots, see ‘A robust interpolated model predictive control based on recurrent neural networks for a nonholonomic differential-drive mobile robot with quasi-LPV representation: computational complexity and conservatism’.
Author Response
We thank you for your thoughtful review. Please see our response below. The updated version of our manuscript is attached.
Comment 1. Quoted from the manuscript: “However, studies show that RL performance can benefit from better representations [6,7], as end-to-end training with only standard RL objectives often yields suboptimal policies.”
Comment/question: in the quoted sentence, better representation of what? The sentence sounds not clear. Please also give the support to the above statement, e.g. giving references.
Response 1. In this sentence we mean representation of the environment states. We update this sentence in the manuscript and added reference to methods, where improving of representations improve the performance of RL agents (page 1, line 25).
Comment 2. It is not clear about what challenge in the problem studied in the present paper. Is the challenge about object identification? Planning for grasping?
Response 2. In this paper, we investigate the application of object-centric representations in model-based reinforcement learning and aim to improve the stability of object-centric agent learning across a visually diverse set of tasks. Previous work (OCRL[1], SOLD[2]) has shown that in object-oriented tasks, the sample efficiency of both model-free and model-based RL algorithms using object-centric representations can be higher than that of standard approaches, which encode observations into a single vector. However, SOLD demonstrated this only on a set of visually similar tasks. We show that the performance of object-centric RL algorithms depends on the quality of the object-centric representations. In tasks where SAVi[3] — the representation model used in SOLD — fails to produce disentangled slots, SOLD demonstrates inferior performance. The proposed KSAVi model reliably disentangles objects and foreground across a visually diverse set of tasks. Our proposed RL algorithm, KORL, uses KSAVi and maintains an object-centric world model, which enables planning in a disentangled latent space. KORL demonstrates more stable performance across all tasks, as measured by the mean normalized score.
[1] Jaesik Yoon, Yi-Fu Wu, Heechul Bae, Sungjin Ahn (2023) An Investigation into Pre-Training Object-Centric Representations for Reinforcement Learning
[2] Malte Mosbach, Jan Niklas Ewertz, Angel Villar-Corrales, Sven Behnke (2025) SOLD: Slot Object-Centric Latent Dynamics Models for Relational Manipulation Learning from Pixels
[3] Gamaleldin F. Elsayed, Aravindh Mahendran, Sjoerd van Steenkiste, Klaus Greff, Michael C. Mozer, Thomas Kipf (2022) SAVi++: Towards End-to-End Object-Centric Learning from Real-World Videos
Comment 3. The idea in the method developed in this paper seems to “divide the whole into pieces”, specifically, treating a group of objects as a whole with attention to the feature of this whole (i.e., the whole system) into individual objects with attention to the feature of each object. In my view, the idea of object-centric is of course more correct, but I hope to provide more supports that the existing literature is not object-centric. In my view, in the context of robotic manipulation tasks, the tasks must be conducted in a way that each object is individually handled.
Response 3. Standard RL approaches use a convolutional encoder that encodes the observation into a single vector. While these representations contain information about individual objects in the scene, their highly entangled nature prevents them from capturing the sparsity of real-world scenes. When such representations are used in monolithic model-based algorithms like TD-MPC2[4], the policy attends to a monolithic latent vector, making it difficult to model individual factors of variation in the scene. In contrast, KORL uses disentangled object-centric representations and a GNN-based world model that maintains a separate latent for each object. This architecture allows the policy to primarily attend to slots containing important information — such as manipulated objects and the robotic arm — while ignoring background slots. Our zero-shot generalization experiment (Section 4.1, page 16, line 547) clearly demonstrates this: when we change the background, the success rate of monolithic TD-MPC2 drops to zero, whereas KORL still solves the task. This confirms that the components of KORL handle slots individually, attending mainly to task-relevant objects and ignoring the background slot.
[4] Nicklas Hansen, Hao Su, Xiaolong Wang (2024) TD-MPC2: Scalable, Robust World Models for Continuous Control
Comment 4. How about robust and resilient with the proposed method. The paper seems to claim that their method is robust, see quoted “Our approach introduces a novel Kuramoto Slot Attention for Video (KSAVi) model that integrates Kuramoto oscillatory neurons with the Slot Attention module to ‘robustly’ extract object representations.” First, there is not any experiment or simulation to test the robustness of the proposed method. Second, I am not sure about what definition of robustness is taken in the present paper; the definition of robustness and resilience in the context of robotics can be found in the literature, see ‘A Novel Resilient Robot: Kinematic Analysis and Experimentation’ (IEEE Access), from which one can see the difference of resilience from robustness. Therefore, the resilience of the proposed method should be a practical concern as there will be some disturbances that may damage the system or stretch the capability of the manipulator beyond its limit.
Response 4. In this paper, by "robustness" we refer to the robustness of RL agent learning — that is, the ability to train an agent across diverse tasks using a fixed set of hyperparameters under a limited interaction budget. Current model-based state-of-the-art algorithms such as TD-MPC2[4] and DreamerV3[7] are considered robust in this sense. For object-centric agents, robustness depends on the quality of representations provided by the object-centric model, which in turn depends on the visual complexity of the environment. Therefore, for an object-centric RL agent to be robust, its representation model must itself be robust, in the same sense as used in [6]: an object-centric representation model is robust if it provides stable and consistent results across visually diverse datasets. In this paper, we show that KORL is more robust than object-centric RL baselines, as it demonstrates more stable performance on a diverse set of tasks, measured by the mean normalized score.
Regarding resilience: our method is based on TD-MPC2, which does not impose any resilience constraints on agent behavior. While extensions of TD-MPC2 that incorporate safety constraints exist [5], we consider this direction to be outside the scope of our research, which focuses on investigating the properties and performance of object-centric representations and disentangled latent world models.
[5] Artem Latyshev, Gregory Gorbov, Aleksandr I. Panov (2025) Safe Planning and Policy Optimization via World Model Learning
[6] Jinwoo Kim, Janghyuk Choi, Ho-Jin Choi, Seon Joo Kim (2023) Shepherding Slots to Objects: Towards Stable and Robust Object-Centric Learning
[7] Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, Timothy Lillicrap (2024) Mastering Diverse Domains through World Models
Comment 5. Quoted from the manuscript: “Automating routine tasks enhances productivity across industries, from defect-based object sorting to automotive welding. Classical continuous control algorithms can achieve this but require meticulous tuning to handle edge cases; unexpected failures may halt production entirely.”
Comment: Please clarify the concept of continuous control. It is noted that the term ‘continuous’ is often used as opposed to “discrete event”..
Response 5. By "continuous control," we refer to control in tasks with continuous action spaces, as opposed to discrete action spaces. Some RL algorithms are specifically designed for continuous action spaces (e.g., TD-MPC2), while others are designed only for discrete action spaces (e.g., DQN[8]). By "classical control algorithms," we refer to traditional mathematical methods used to regulate dynamic systems, such as PID controllers.
[8] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, Martin Riedmiller (2013) Playing Atari with Deep Reinforcement Learning
Comment 6. Please also clarify the world model in the present paper.
Response 6. In our research, we use the standard practical RL definition of a world model as a component of the RL agent that consists of learned approximations of the environment's transition and reward functions. We have extended the discussion of world models in Section 3.1 (page 5, line 208; page 6, line 223).
Comment 7. Please compare the proposed method for robotic manipulation tasks with the latest method based on the RNN and MPC for mobile robots, see ‘A robust interpolated model predictive control based on recurrent neural networks for a nonholonomic differential-drive mobile robot with quasi-LPV representation: computational complexity and conservatism’.
Response 7. KORL learns representations and dynamics from raw images, whereas in the referenced paper [9] the pipeline starts from known robot equations, which are converted into a tractable LPV form. The authors design a controller with provable stability and robustness guarantees, using an RNN to solve the resulting QP optimization problem. While both approaches employ MPC-style planning — ours using MPPI and theirs using classical MPC — the remaining components and application domains are quite distinct. Most fundamentally, we use reinforcement learning as the training paradigm and learn directly from pixel observations, whereas [9] rely on known system equations and do not use learning from interactions. Due to these substantial differences in input modality and methodology, a direct experimental comparison would not be meaningful.
[9] Hadian, Mohsen ; Zhang, W. J. ; Etesami, Danial (2024) A robust interpolated model predictive control based on recurrent neural networks for a nonholonomic differential-drive mobile robot with quasi-LPV representation: computational complexity and conservatism
Author Response File:
Author Response.pdf
Round 2
Reviewer 2 Report
Comments and Suggestions for AuthorsI still concern the authors’ response to my comment 4. I think that the authors need to write some of their response into the paper, e.g., the definition to robustness they take in this paper, rather than refer to the framework TD-MPC2 their work is based upon. Also, the future work may be discussed regarding resilience, though the resilience is out of the scope of the present paper. Resilience is not equivalent to safety. Overall, the literature on the general definition to robustness and resilience may be commented and then special definition to their problem follows. Such a manner can improve the generality of their work, Currently, the paper looks like quite narrow in implication, which is emphasised with this journal ‘Technologies”.
The same concern covers the authors’ response to my comment 7, some of which may be incorporated into the manuscript to improve the generality of their contribution and implication.
Author Response
Thank you for your clarifying comment. Please find our response below. Regarding Comment 4, we expanded Section 3.8 (page 14, line 498) to include a discussion of robustness definitions in robotics, deep learning, and deep reinforcement learning, as well as to clarify the definition of robustness adopted in our work. We also added a discussion of possible future work on how the resilience of the proposed object-centric RL algorithm, KORL, can be enhanced (page 18, line 639). To address your concern regarding Comment 7, we expanded Section 3.1 (page 6, line 255) to discuss the application of MPC and to highlight the differences between our learning-based approach and classical control-theoretic methods.
