Autonomous Grasping of Deformable Objects with Deep Reinforcement Learning: A Study on Spaghetti Manipulation

Gamolped, Prem; Koomklang, Nattapat; Mowshowitz, Abbe; Hayashi, Eiji

doi:10.3390/robotics14080113

Open AccessArticle

Autonomous Grasping of Deformable Objects with Deep Reinforcement Learning: A Study on Spaghetti Manipulation

by

Prem Gamolped

¹,

Nattapat Koomklang

¹,

Abbe Mowshowitz

²

and

Eiji Hayashi

^1,*

¹

Department of Creative Informatics, Kyushu Institute of Technology, Fukuoka 820-8502, Japan

²

Department of Computer Science, The City College of New York, 160 Convent Avenue, New York, NY 10031, USA

^*

Author to whom correspondence should be addressed.

Robotics 2025, 14(8), 113; https://doi.org/10.3390/robotics14080113

Submission received: 22 April 2025 / Revised: 23 July 2025 / Accepted: 6 August 2025 / Published: 18 August 2025

(This article belongs to the Section Humanoid and Human Robotics)

Download

Browse Figures

Versions Notes

Abstract

Packing food into lunch boxes requires the correct portion to be selected. Food items such as fried chicken, eggs, and sausages are straightforward to manipulate when packing. In contrast, deformable objects like spaghetti can give challenges to lunch box packing due to their fragility and tendency to break apart, and the fluctuating weight of noodles. Furthermore, achieving the correct amount is crucial for lunch box packing. This research focuses on self-learned grasping by a robotic arm to enable the ability to autonomously predict and grasp deformable objects, specifically spaghetti, to achieve the correct amount within specified ranges. We utilize deep reinforcement learning as the core learning. We developed a custom environment and policy network along a real-world scenario that was simplified as in a food factory, incorporating multi-sensors to observe the environment and pipeline to work with a real robotic arm. Through the study and experiments, our results show that the robot can grasp the spaghetti within the desired ranges, although occasional failures were caused by the nature of the deformable object. Addressing the problem under varying environmental conditions such as data augmentation can partially help model prediction. The study highlights the potential of combining deep learning with robotic manipulation for complex deformable object tasks, offering insight for applications in automated food handling and other industries.

Keywords:

deformable object manipulation; robotic grasping; deep reinforcement learning; spaghetti grasping; self-learning

1. Introduction

Automation in industry has become a key driver of efficiency because of its role in repetitive tasks such as welding, painting, assembly, packaging, and more [1]. Moreover, its integration into robotics technology can provide many advantages [2], mainly for work speed, and with a specific device setup, it can be reprogrammed for other tasks. Many applications in industry can be transformed into robotic-based systems. A notable example in industry is the use of bin-picking robots using 3D object recognition where robots will autonomously recognize the parts required for specific tasks, and pick these for humans [3]. This allows humans to focus on a more complex and responsible task, thus optimizing overall productivity. In Japan, both large and small lunch box preparation companies mainly rely on human workers to assemble food into lunch boxes. In the production line [4], lunch boxes move slowly along a conveyor while human workers manually pick and place prepared foods into lunch boxes as shown in Figure 1. This process shares similarities with conventional bin-picking robotic systems; however, in this case, the target objects are food items. The selected items include general types such as fried chicken, meatballs, and sausages [5], as well as small-piece items like dried radish, hijiki seaweed, and deep-fried tofu [6].

Robotic manipulation of deformable food items presents unique challenges due to the fragile, flexible, and non-uniform nature of these objects. In this study, we focus on the task of robotic spaghetti grasping for lunch box preparation, a task that requires both delicacy and precision. Spaghetti strands are easily damaged when mishandled, and grasping the correct amount adds further complexity due to their loosely structured and entangled nature. These characteristics make spaghetti a particularly difficult target for robotic automation in food packaging. While prior research in robotic food handling has shown success in manipulating more robust and rigid items such as fried chicken, broccoli, meatballs, or sausages. However, deformable foods like spaghetti remain largely underexplored. A few studies have attempted to tackle similarly delicate items such as seaweed or noodlelike foods, but consistent, portion-controlled grasping of spaghetti remains a significant challenge, especially in real-world settings requiring repeatable, gentle handling.

This study aims to address that gap by developing an autonomous grasping system that enables a robotic arm to learn how to pick up the appropriate amount of spaghetti within a defined weight range. The main objective is to investigate whether deep reinforcement learning (DRL) can be effectively used to learn this manipulation task in a self-supervised manner, without relying on pre-programmed grasping. To achieve this, we design a complete robotic system that includes a robotic arm, an RGB-D camera for visual perception, an electronic scale for real-time weight feedback, and a custom-designed flexible gripper tailored for handling delicate foods. The DRL framework is used to train the robot in a custom environment that simulates real-world variability. The training process includes camera-based object recognition, real-time feedback from the scale, and a reward function that encourages accurate, damage-free, weight-controlled grasping. The system architecture comprises three phases: training with configurable settings and evaluation on a trained model. To enhance robustness across different environments, we implement data augmentation techniques to improve the generalization of the learned policy. Additionally, we use a custom feature extractor based on a ResNet-50 architecture to process visual inputs effectively.

In summary, this study applies DRL to a complex, real-world task involving deformable object manipulation with strict quantitative constraints. The system is validated through a series of physical experiments, showing the potential for deploying self-learning robotic systems in automated food packaging environments. The article is organized as follows: Section 2 provides an analysis of studies in the field, as well as insights from our own preliminary research. In Section 3, we outline the methodology which includes a noodle-like food grasping problem, problem formulation, and the system overview. Section 4 describes the developed framework to perform noodle-like food grasping and explains how DRL is applied to real-world problem scenarios. Section 5 discusses the experimental setup and presents the results of our study. Finally, Section 6 and Section 7 conclude the article and offer perspectives on future research directions.

2. Literature Reviews

This section reviews relevant studies on autonomous robotic systems and AI applications related to our research. We also include insights from our preliminary findings. We will discuss automation and robotics in food handling, deep reinforcement learning for robotic manipulation, robotic handling of deformable and delicate objects, multi-sensor integration within robotic systems, and the applications and limitations of robotics in food automation.

2.1. Automation and Robotics in Food Handling

Automation and robotics are revolutionizing food handling by improving efficiency, precision, and hygiene. Traditional automation systems are limited in their ability to handle diverse food items. However, recent advances have allowed robotic systems to perform complex tasks—such as sorting, packing, and manipulating food—in dynamic and unstructured environments, as illustrated in Figure 2. Advanced sensors such as RGB-D cameras and force sensors allow robots to perceive and interact with food effectively. Vision-based systems [7,8] are widely used for fruit sorting and grading based on attributes such as color and size [9]. Soft robotics has also gained traction for handling delicate items such as chopped green onions or fried shrimp on the tray without damage [10]. Although the automated system is highly effective and capable of handling large numbers of products with precision, there are limitations in its adaptability, and the system requires reprogramming whenever its functionality needs updating or modifications for new tasks, which can be time consuming and require specialized expertise. The system excels in efficiency for predefined tasks.

2.2. Deep Reinforcement Learning for Robotic Manipulation

Deep reinforcement learning (DRL) has emerged in recent years as a powerful machine learning approach for learning optimal actions in complex tasks to maximize the return reward value

G_{t}

. Some classic Temporal Difference (TD) [12] reinforcement learning (RL) methods, such as the Q-learning algorithm [13], are well known for solving simple tasks such as grid world navigation. In this task, the environment consists of a 2D grid where an agent must navigate to a goal while avoiding obstacles. The problem is formulated on the basis of MDP as shown in Figure 3. The agent operates in a finite set of states S which are grid cells, and in each state

s \in S

, the agent selects the action

a \in A

where

a \in

{up, down, left, right} based on state s and transitions to new state

s^{'}

. After taken action a, the information is stored in the Q-table which contains the value of taking action for a given state—the so-called Q-value (

Q (s, a)

). Then the following equation is used to update the Q-value.

Q (s, a) \leftarrow Q (s, a) + α [r + γ max_{a^{'}} Q (s^{'}, a^{'}) - Q (s, a)],

(1)

where

α

is the learning rate that is used to determine how much the updated value will be, r is the immediate reward for taking action a in state s.

γ

, the discount factor, represents the importance of future rewards,

{max}_{a^{'}} Q (s^{'}, a^{'})

is the maximum Q-value for the next state

s^{'}

, which represents the best possible reward achievable from that state.

Classic RL algorithms are good for solving simple tasks as shown in Figure 4, but it becomes challenging to apply them in the robotics field, where the problems will often relate to high-dimensional observations (e.g., images, point cloud data). Handling such complexity requires an efficient algorithm to approximate. This leads to the development of algorithms that integrate the classic RL algorithms with a Deep Neural Network (DNN) which enables the agent’s ability to solve complex problems effectively. Chen Y.L. et al. [7] integrate the You Only Look Once (YOLO) algorithm for object detection and the Soft Actor-Critic (SAC) algorithm for robotic grasping, which enables a 6 Degree of Freedoms (DoFs) manipulator to learn self-grasping tasks. The combination of YOLO for rapid object recognition and SAC for policy optimization ensures efficient grasping in real-world applications including unseen objects. More than that, this study utilizes the use of sim-to-real to minimize the cost and risk of real-world training. Mohammed M.Q. et al. [14] propose a pick-and-place robotic manipulator system within a simulated scenario, designed to effectively grasp target objects in cluttered environments. The system utilizes RGB-D images and point cloud data to generate 36 distinct angles of heightmaps as inputs to the proposed grasping policy network which consists of DenseNet-121 with FCN. A Deep Q-learning algorithm is employed to determine the optimal grasping position, enhancing the system’s ability to handle complex and dynamic manipulation tasks.

2.3. Robotic Handling of Deformable and Delicate Objects

Robotic handling of deformable objects and delicate objects has become a significant challenge in automation due to their susceptibility to damage and variability in shape, and the need for a firm yet gentle grip. Low J.H. et al. [18] propose a reconfigurable gripper capable of adjusting its fingers to various grip poses using the Grip Pose Comparator (GPC) framework, enabling the grasping of a wide range of food items such as broccoli, potatoes, sausages, and tomatoes. Wang Z. et al. [10] propose a pneumatically driven soft needle gripper designed for food handling, featuring dual capabilities of grasping and piercing. This design enables the manipulation of shredded and chopped food materials such as shredded cabbage and chopped green onions. Franco L. et al. [19] propose a novel tendon-driven soft-rigid double-scoop gripper capable of operating in constrained or narrow spaces, such as within food tray compartments. This design enables the grasping of various food items—such as meatballs, cookies, carrots, and sausages. The soft-rigid tendon-driven mechanism allows the fingers to flexibly conform to and place items into containers of varying dimensions. Wang Z. et al. [20] introduce a scooping-binding robotic gripper designed to handle a wide variety of food shapes and sizes, such as cheese, green peppers, eggs, and tomatoes. By incorporating a flexible string mechanism, the gripper is capable of securely grasping soft and delicate materials without causing damage. In our laboratory, we conduct research on the robotic assembly of food items into lunch boxes using a robotic arm. As illustrated in Figure 5, we have developed and evaluated various types of grippers tailored for handling Japanese foods, including onigiri (rice balls), fried chicken, and sliced ham, as illustraed in Table 1. The table below presents a comparison of different gripper designs.

2.4. Multi-Sensor Integration in Robotic Systems

Multi-sensor integration plays a crucial role in modern robotic systems by enabling perception-based feedback, which is essential for decision making in task execution. The selection of applicable and robust sensors is critical to the system’s performance. By combining data from multiple sensors such as RGB-D cameras, force-torque sensors, and Inertial Measurement Units (IMUs), the robot’s ability to perceive and interact with its environment is significantly enhanced, enabling it to handle more complex tasks with better accuracy and reliability. A widely adopted technique for sensor fusion is the Kalman Filter (KF) [22], which provides an optimal estimate of the system state by integrating noisy measurements from multiple sensors. Ahmed M. et al. [23] propose a real-time tracking system of the human wrist’s position and use the KF technique to filter out variance, enhancing the accuracy and reliability of localization data. An advanced approach [7,8,23,24] to multi-sensor integration involves combining sensor data with deep learning techniques, such as object recognition and localization, to enable autonomous pick-and-place operations. This integration enhances the robot’s capabilities, allowing it to independently recognize and grasp objects without explicit human commands. Moreover, by leveraging multi-modal sensor data, the system can adapt to varying environmental conditions and object types, improving robustness. This approach not only reduces the need for manual intervention but also accelerates task execution in dynamic and unstructured environments, making it suitable for industrial robotics applications.

2.5. Applications and Limitations of Robotics in Food Automation

The food industry faces challenges such as high production demands and significant labor requirements to operate food manufacturing facilities [25]. Robotics technology offers effective solutions to these issues and has become increasingly used in the industry. Robots are well-suited for performing repetitive tasks. Candemir A. et al. [26] developed an autonomous pick-and-place robotic system capable of recognizing food objects in industrial settings. Tulapornpipat W. et al. [21] proposed a multi-tool gripper for robotic applications in the food industry, integrating a pneumatically actuated soft gripper and a vacuum pad. Their system can recognize various types of Japanese food and perform pick-and-place operations effectively. These robotic arms often operate alongside conveyor systems in the production line. Among the most commonly used types is the articulated robotic arm, designed to mimic the movements of a human arm. This type of robotic arm is capable of performing complex motions, which makes it highly versatile for tasks such as sorting, pick and place, and assembly. Figure 6 illustrates the sequence of the vision-based food grasping pick-and-place task that perceives the environment with RGB-D image data and processes it through food detection and segmentation, as depicted in Figure 7. While robotics and AI technology offer significant advantages in food automation, several limitations still remain. Handling non-rigid and irregularly shaped food items [27] is challenging and requires advanced grippers and control algorithms; thus, most systems opt for fixed food items on the production line. High costs and the complexity of integrating robotic systems into existing operations [28] are additional barriers. Furthermore, the lack of standardized solutions for diverse food types and production environments limits scalability. The maintenance and expertise required to operate and program robots further add to the operational challenges. These limitations highlight the need for ongoing research and development to enhance the adaptability and affordability of robotic systems in the food industry.

3. Methodology

3.1. Noodle-like Grasping Problem

The problem framework of this study is based on a food production facility in Japan where workers are tasked with assembling foods into lunch boxes that are constantly moving on a conveyor in the production line, as illustrated in Figure 1. Food items with simple shapes, such as fried chicken, sausages, or grilled pork, can be transferred efficiently from the preparation tray to the lunch box in a straightforward manner. However, smaller sized items (e.g., shredded onion or chopped beef, as shown in Figure 1b) or noodle-like foods must be arranged in precise amounts when packaging. The nature of this task is inherently repetitive, as employees are required to continuously pick and measure the weight of spaghetti. If the measured weight is within the target tolerance, the spaghetti is then placed into the desired compartment of the lunch box, as illustrated in the flowchart of decision making for spaghetti and other noodle-like food items in Figure 8b. Human hands are dexterous, so this ability enables them to observe the spaghetti in the tray and execute complex grasping behaviors. This includes untangling the noodles before grasping, followed by pinching and shaking to separate the sticking noodles, ensuring an optimal grasp. However, the process often needs to be repeated several times to achieve the correct amount. Furthermore, the varying states of spaghetti can influence the grasping strategy, requiring adjustments such as deeper or wider grasping to ensure the correct amount and effective handling.

The overview of the spaghetti grasping task in robotics involves mimicking human grasping behavior, as illustrated in Figure 8, to develop a self-learning system. At each time step t, the system obtains the state

s_{t}

of the spaghetti by observing the top view of the tray, supplemented with additional information such as a digital scale and a pretrained detection model for identifying and localizing the spaghetti. DRL is employed to generate the action

a_{t}

for grasping. Following the grasp, the system measures the grasped weight using the digital scale to evaluate the success of that action

a_{t}

. The details of each component will be discussed in the following sections.

3.2. Problem Formulation

As explained in Section 3.1, the self-learning system acquires knowledge through interaction with the environment and trial and error to approach the optimal policy for achieving the target weight. Therefore, a Markov decision process (MDP) [12] was selected to formulate the problem in this study. The agent refers to the robot arm, which aims to learn the optimal grasping strategy; environment represents the spaghetti in the tray with which the agent interacts during the learning process as shown in Figure 9. The agent interacts with the environment over a sequence of discrete time steps

t = 0, 1, 2, 3, \dots, T

. At each time step t, the agent perceives an observation from the environment and obtains the current state

s_{t}

. Based on this state, the agent selects an action

a_{t}

and applies it to the environment. The environment then transitions to a new state

s_{t + 1}

and provides a reward

r_{t + 1}

, which quantifies the outcome of the action. In the MDP framework, such transitions are generally characterized by a probability function

P (s_{t + 1} ∣ s_{t}, a_{t})

; however, in this study’s model-free approach, this transition probability is unknown and not explicitly estimated. Hence, the agent instead learns directly from observed interactions. This process is repeated until the end of the episode (

t = T

). In Equation (2), P is unknown in the model-free setting.

s_{t + 1} \sim P (s_{t + 1} ∣ s_{t}, a_{t}) .

(2)

The objective function

J (π)

of a typical RL algorithm, stated in Equation (3), is to learn a policy

π

that maximizes the cumulative reward the agent receives over time. The discount factor

γ

plays a crucial role in determining the future rewards over the total learning sequence T. Typically, the discount factor is defined within the range

γ \in [0, 1]

, where its value influences the trade-off between long-term and short-term rewards.

J (π) = E_{π} [\sum_{t = 0}^{\infty} γ^{t} r_{t + 1}]

(3)

The MDP is defined as the tuple

(S, A, R, P, γ)

, where S represents the finite set of states, A denotes the finite set of actions, R is the reward function, P denotes the state transition probability, and

γ

is the discount factor. Learning in DRL is fundamentally based on trial and error, where the agent interacts with the environment and collects experience from multiple trajectories

τ_{t}

. The objective function is to optimize a policy

π

, parameterized by

θ

, the parameters of the neural network

(π_{θ})

, that maximizes the expected cumulative reward. To formulate the problem from the spaghetti grasping task, the robot needs to observe the environment and use the algorithm to generate the action given the state.

(P_{t}, G_{t}) = f (S_{t}),

(4)

Therefore, the mathematical model is formulated as shown in Equation (4), where the function f represents the policy that maps the observation space to the action space at each time step t. The obtained state consists of image data and scale weight data, while the action (grasping pose

P_{t}

and gripper width

G_{t}

) is based on the learned policy. Given that the observation space is high-dimensional and requires continuous values—such as a certain pose within a defined range or a gripper width within the given interval—rather than discrete actions, the Soft Actor-Critic (SAC) [29] algorithm was chosen for this problem. Regarding Equation (4),

s_{t}

is structured as a dictionary type where

s_{t} \in I_{r g b}, I_{d e p t h}, I_{m a s k}, w_{g r a s p}, w_{t r a y}

; each element is as follows:

$I_{r g b} \in R^{H x W x 3}$ is the RGB image providing visual features including lighting conditions;
$I_{d e p t h} \in R^{H x W}$ is the depth image, representing spatial geometry;
$I_{m a s k} \in R^{H x W}$ is the segmentation mask of the target object;
$w_{g r a s p}$ is the grasped weight of spaghetti from selected action $a_{t}$ ;
$w_{t r a y}$ is the total weight of spaghetti in the tray excluding the tray weight.

After observing the environment and obtaining the state

s_{t}

, each element undergoes processing and normalization before being passed to the custom policy network (which will be discussed in detail in Section 4). The processed data is input to the pretrained ResNet-50 to extract the feature for action selection. The agent selects an action

a_{t}

from the action space A, where

a_{t} \in {p_{x}, p_{y}, d_{i n s e r t}, d_{g r i p p e r}}

. In the action space,

p_{x}

and

p_{y}

represent the pixel coordinates specifying the grasp point in the image. The variable

d_{insert}

denotes the insertion depth of the gripper from the spaghetti surface, while

d_{gripper}

indicates the gripper width prior to grasping the spaghetti. Following the action selection by the policy

π (a | s)

given the state, each element in A is mapped to the corresponding physical range of the robotic system to ensure proper execution. During action execution, the system measures the weight before grasping

w_{i}

, and during grasping

w_{f}

. The grasped weight

w_{g r a s p}

is computed as the absolute difference between

w_{f}

and

w_{i}

to ensure a non-negative output as shown in Equation (5).

w_{g r a s p} = | w_{i} - w_{f} |

(5)

After executing the selected action,

w_{g r a s p}

is used for reward computation. The grasped weight is then constrained to the interval

w_{g r a s p} \in [0, 200]

, as defined in Equation (6), in accordance with the specifications of the reward function implemented in this study.

\begin{matrix} w_{g r a s p} = \{\begin{matrix} 200, & w_{g r a s p} > 200, \\ w_{g r a s p}, & 0 \leq w_{g r a s p} \leq 200, \\ 0, & w_{g r a s p} < 0 \end{matrix} \end{matrix}

(6)

In this study, the reward computation is derived from the Probability Density Function (PDF) to ensure that the best reward is centered on the desired target grasping weight

μ

, while incorporating an interval that accounts for the allowable error margin

σ

. The y-axis of the reward function is interpolated to the range

R \in [0, 1]

to leverage the benefits of reward scaling, as discussed in [29]. Subsequently, the PDF is multiplied by a constant c to further adjust the reward scaling during training.

r (w_{g r a s p}) = c \cdot exp (- \frac{{(w_{g r a s p} - μ)}^{2}}{2 σ^{2}})

(7)

Given the described MDP to formulate the problem, the robot learns an optimal grasping policy by maximizing the expected cumulative reward augmented with entropy regularization to improve the exploration behavior of the agent. Thus, at each time step t, the agent observes the state

s_{t}

and selects an action

a_{t}

from the stochastic policy

π (a_{t} | s_{t})

parameterized by

θ

. The environment transitions to state

s_{t + 1}

and returns a reward

R (s_{t}, a_{t})

based on the explanation of Equation (7). The SAC optimizes a stochastic policy by maximizing both the expected reward and the entropy term

H (π)

, which encourages exploration:

J (π) = \sum_{t = 0}^{T} E_{(s_{t}, a_{t}) \sim ρ_{π}} [R (s_{t}, a_{t}) + α H (π (\cdot | s_{t}))]

(8)

where

α

is the temperature parameter that balances reward maximization and exploration. SAC is the algorithm that optimizes both the actor (policy) and critic (Q-function) while maintaining an entropy term to encourage exploration. The critic is responsible for estimating the soft Q-value function

Q_{θ} (s, a)

, which is used to evaluate the expected return of taking action a in state s. Generally, two separate Q-functions,

Q_{θ_{1}}

and

Q_{θ_{2}}

, are used to mitigate overestimation bias. The critic is trained by minimizing the Mean Squared Error (MSE) loss. The actor (policy) is responsible for selecting actions utilizing a stochastic Gaussian policy. The policy is optimized to maximize the soft Q-value while encouraging exploration using entropy regularization.

3.3. System Overview

This research presents a robotic system designed for the autonomous grasping of noodle-like objects, specifically spaghetti using DRL. The system integrates an articulated robot arm with a gripper designed to grasp delicate noodle-like food. To perceive the environment, an RGB-D camera is utilized to identify and localize the food item.

3.3.1. Hardware System

In this study, we utilize a Motoman SIA5F 7 DoFs articulated arm (Yaskawa Electric Corporation, Kitakyushu, Fukuoka, Japan) to ensure a flexible range of motion necessary for precise grasping, which is capable of handling objects up to 5 kg. The end effector is highly adaptable to various tools, as depicted in Figure 5. The high number of DoFs enables greater flexibility in motion control, making it well-suited for picking food from a tray and placing it into a lunch box on a moving conveyor. We utilize an RGB-D sensor (Azure Kinect Developer Kit, Microsoft Corporation, Redmond, WA, USA) to obtain environmental state for the agent in DRL and to detect and segment spaghetti. This camera provides high-quality depth data, and its RGB module features automatic exposure adjustment based on ambient lighting. The fingers were specifically designed for grasping noodle-like food items. A NEMA17 stepper motor (17E13S0404AF2, StepperOnline, Nanjing, China) was integrated to enable precise position control, and the fingers were fabricated using a 3D printer (X-MAX, QIDI Tech, Wenzhou, China), allowing for future modifications in size and shape to accommodate different food items. To grasp noodle-like food items, specifically spaghetti, the gripper mechanism was designed with a single contact point, as illustrated in Figure 10. Additionally, the tip of each finger features a slight bend, allowing it to insert into the spaghetti pile and securely hold the strands during grasping, as shown in Figure 11. The fingers were 3D-printed using a soft material to ensure that the gripper does not damage the spaghetti during grasping. The grasping weight is measured using the scale weight as explained in reference to Equation (5). Figure 12 illustrates the hardware connections within the system. The system is designed to be highly adaptable to various external tools and connections. LAN switching is utilized to facilitate communication between devices, enabling connection to multiple PCs and the robot’s controllers to operate within the same network. Table 2 below shows the PC specifications utilized in this study.

3.3.2. Software System

The software system is built on the Robot Operating System (ROS) platform, providing a modular and scalable framework for integrating perception, control, and reinforcement learning. The system runs on Linux Ubuntu 20.04 LTS with ROS Noetic. This enables efficient software management. We utilize a motion planning framework (MoveIt 1.1.14), which facilitates motion planning and collision checking to ensure obstacle avoidance in the environment. To effectively manage software across multiple devices, we organize the system into multiple ROS workspaces (e.g., robot workspace, camera workspace, detection workspace) and integrate each within the implementation code. This approach enhances modularity and simplifies software development and maintenance.

Prior to this study, we developed a food detection and segmentation model based on Hybrid Task Cascade (HTC) [30], utilizing a ResNet-50 backbone on the food dataset we created, as illustrated in Figure 7. We leverage this model to identify and localize spaghetti. The detection model also outputs a mask image of the detected spaghetti, which is then utilized as input for the DRL model during training, as illustrated in Figure 13. Our system employs a virtual environment (Miniconda3) to prevent software version conflicts between code workspaces. We utilize DRL framework (Stable-Baselines3 1.8.0), which offers various deep reinforcement learning algorithms for training. The Soft Actor-Critic (SAC) algorithm was selected, and we customized the policy for our spaghetti grasping environment. Additionally, we use OpenAI Gym (Gym 0.21.0), which provides a diverse collection of reference environments and a standardized API for developing customized environments. The execution pipeline begins with the perception module, where the RGB-D camera captures raw visual data. The detection model processes the images, generating bounding boxes and segmentation masks of the spaghetti. These outputs are then passed to the DRL model, which predicts optimal grasping parameters, including pixel coordinates, gripper width, and insertion depth. The predicted grasp pose is transformed from pixel to world coordinates, and the final grasp is executed via MoveIt’s motion planning framework.

4. Noodle-like Grasping Framework

Grasping delicate and deformable objects like spaghetti presents unique challenges compared to rigid object grasping. The flexibility and entangled nature of noodles introduce variability in shape, contact forces, and weight fluctuation, which makes grasping ineffective. To address these challenges, we propose a noodle-like grasping framework that integrates a DRL vision-based grasping strategy for the policy network. Additionally, an augmentation technique is utilized in the training process to increase model generalization and enhance the robot’s ability to grasp and manipulate spaghetti objects.

For every timestep t, the process begins with an RGB-D camera detecting and segmenting spaghetti using a detection model trained in our lab, as mentioned in Section 3.3.2, to generate a mask image of the spaghetti, as illustrated in Figure 13. Depth information is utilized to represent the spaghetti strands, while RGB data is also used to account for lighting conditions. Additionally, the tray weight (representing only the weight of the spaghetti) is measured, which is crucial since the spaghetti’s weight may fluctuate over time due to dryness. This information is input into the DRL policy to generate actions a, as shown in Figure 14. Figure 15 illustrates that policy

π

takes input from the state space S as described in Section 3.2 for the policy to generate actions a. We employ the SAC algorithm for the robot to learn a self-learning grasping policy. We design a custom policy network tailored for high-dimensional visual inputs and physical properties. The policy processes the inputs by extracting features, fusing them, and then passing the fused features through a fully connected neural network (FCNN). We utilize the pretrained ResNet-50 as a backbone to extract complex features from the input data. To enhance the model’s generalization capability, we applied data augmentation to both the visual and the corresponding tray weight measurements. For the visual data, augmentations were designed to simulate varying lighting conditions and geometric transformations. For the weight data, Gaussian noise was added to mimic sensor variability and natural fluctuations in the measured weight, as illustrated in Figure 14. After the policy generates the action

a_{t}

based on the observed state

s_{t}

, the policy outputs the pixel coordinates in the image (

p_{x}

and

p_{y}

), the insertion depth

d_{i n s e r t}

from the spaghetti surface at the selected pixel coordinate, and the gripper width

d_{g r i p p e r}

to grasp the spaghetti. Since the range of

a_{t}

is constrained to

[- 1, 1]

due to the tanh activation function in the actor network, it is necessary to map these values to the corresponding operational values of the actual hardware.

To obtain the grasping pose

P_{t}

, the pixel coordinates

p_{x}

and

p_{y}

are mapped to the image resolution range to determine the 2D pixel coordinates

(i, j)

. This process incorporates the spaghetti marker from the detection model to ensure that the selected 2D coordinates do not fall within an ungraspable area (outside the mask area), as illustrated by a green dot in Figure 16. The corresponding 3D coordinates are then derived using the pinhole camera model, where the depth information

Z_{i, j}

is retrieved based on the 2D pixel coordinates (

i, j

) in the depth image, as defined in Equation (9).

Z = I_{d e p t h} (i, j) .

(9)

The camera’s intrinsic and extrinsic parameters are applied to transform the 2D coordinates into 3D space

P_{t} = {[X, Y, Z]}^{T}

and convert into the robot base frame to ensure the correct positioning for motion planning. To determine the actual grasping depth, the insertion depth

d_{i n s e r t}

is defined as the depth relative to the grasping point

(X, Y)

.

Z_{s c a l e d} = \frac{(d_{i n s e r t} + 1)}{2} \cdot 0.01 + (Z - 0.01) .

(10)

Therefore, we interpolate the action range

[- 1, 1]

to a new range that accounts for a 1 cm insertion depth from the spaghetti surface. Finally, the grasping pose will be

P_{t} = {[X, Y, Z_{s c a l e d}]}^{T}

. The gripper width

G_{t}

is obtained by interpolating from the range

[- 1, 1]

to the range required by the actual hardware.

After mapping the action

a_{t}

, the robot motion command assigns the grasping pose

P_{t}

to the pose variable, which includes position

(X, Y, Z_{s c a l e d})

and quaternion orientation. The rotation is fixed at

90^{°}

from the surface to mimic human grasping behavior, as illustrated in Figure 8a. The robot then moves from the home position to the target pose and slowly grasps the spaghetti with the gripper width

G_{t}

. The robot then executes the predefined joint position to measure the grasping weight

w_{g r a s p}

, as explained in Equations (5) and (6). Considering the drop pose after grasping the spaghetti is crucial. Therefore, we develop an algorithm that randomly places the spaghetti back in the tray while accounting for drop pose history to prevent placements that are too close to previous locations. The grasping weight

w_{g r a s p}

is returned and used to compute the reward

r_{t}

as defined in Equation (7). The robot returns to the home position and observes the state

s_{t + 1}

after interacting with the environment. The agent evaluates whether the step should terminate. If the grasping weight

w_{g r a s p}

falls within the tolerance, then the episode is marked as complete (done = true). Otherwise, done remains false, and the agent continues execution until the termination condition is met. At the end of the episode, the transition

(s_{t}, a_{t}, r_{t}, s_{t + 1})

is stored in the replay buffer.

The policy

π

is updated by randomly sampling transitions from the replay buffer. Thus, in each gradient step, a batch is sampled. The critic network Q, parameterized by

θ

, is updated by minimizing the mean squared error (MSE) between the predicted Q-value

Q_{θ} (s_{t}, a_{t})

sampled from the replay buffer and the target Q-value. Two Q-functions,

Q_{θ_{1}}

and

Q_{θ_{2}}

, are used to mitigate overestimation bias.

J_{Q} (θ) = E_{(s_{t}, a_{t}) \sim D} [\frac{1}{2} {(Q_{θ} (s_{t}, a_{t}) - (r (s_{t}, a_{t}) + γ \cdot E_{s_{t + 1} \sim ρ} [\hat{Q} (s_{t + 1}, a) - α \cdot log π (a^{'} | s_{t + 1})]))}^{2}] .

(11)

where

\hat{Q} (s_{t + 1}, a_{t})

is the target Q-value,

γ

is the discount factor, and

α

is the temperature parameter that controls the entropy weight. The policy network

π

, parameterized by

ϕ

, is updated by minimizing the following objective function.

J_{π} (ϕ) = E_{s_{t} \sim D} [α \cdot π_{ϕ} (a_{t} | s_{t}) - Q_{θ} (s_{t}, a_{t})] .

(12)

The policy is trained to maximize the expected Q-value while encouraging exploration through entropy regularization. The temperature parameter

α

is updated to maintain a target entropy

H_{t a r g e t}

, which ensures a balance between exploration and exploitation.

L_{α} = E_{a_{t} \sim π} [- α \cdot (log π (a_{t} | s_{t}) + H_{target})] .

(13)

The self-learning algorithm for spaghetti grasping utilizes the SAC framework to optimize robotic grasping performance. It initializes key parameters and resets the robot before each episode. At each timestep t, the robot observes its state

s_{t}

, preprocesses data, selects an action

a_{t}

, and executes the grasp. The grasping outcome is evaluated through a reward function

r_{t}

, the next state

s_{t + 1}

is observed, and experiences are stored in a replay buffer. The model updates periodically, refining the policy to improve grasping accuracy over time. At each timestep t, the robot motion is executed as illustrated in Figure 17.

5. Experimental Setup and Results

5.1. Experiment Setup

The experimental setup is designed to support multiple parameter configurations. The environment consists of the robotic system illustrated in Figure 10. To enable effective training for grasping noodle-like food items, the robot executes the action

a_{t}

, generated by the policy

π

, after observing the current state

s_{t}

. A random drop pose

P_{d r o p}

algorithm is developed to create diverse object placements and avoid repeated positions, with the drop pose history defined as

H_{d r o p} = p_{1}, p_{2}, p_{3}, \dots, p_{n}

. At each timestep t, the new drop pose

p_{n e w}

is randomly generated under the food tray area such that:

∥ p_{new} - p_{i} ∥_{2} \geq δ, \forall p_{i} \in H_{d r o p} .

(14)

where

δ

is a distance threshold that ensures the new drop pose is not placed too close to any previous poses in

H_{d r o p}

.

Raw spaghetti is gathered into batches of approximately 300 g and placed in a container filled with water. It is then heated for 10 minutes. After cooking, the water is drained from the container, and the spaghetti is left to dry for approximately 10 minutes. A small amount of water and oil is added after the drying process to facilitate grasping and reduce entanglement of the spaghetti strands. The cooked spaghetti is then placed in the food tray on a scale.

The configuration is set up for training the model and is loaded into both the agent and the environment. The process begins with an environment reset, which includes returning the robot to its home position and zeroing the gripper. At each timestep t, the agent observes the environment and obtains the state

s_{t}

. The policy

π

maps the state

s_{t}

to an action

a_{t}

, as illustrated in Figure 15. The output action

a_{t}

is then scaled to match the required range of the real hardware. The resulting grasp pose

P_{t}

and gripper width

G_{t}

are sent as motion commands to the robot. The system calculates the reward based on Equation (7), using the grasped weight

w_{g r a s p}

. Episode termination is checked after each action; the episode ends if the action

a_{t}

results in a grasp within the specified tolerance. Otherwise, the timestep is incremented (

t + 1

), and the process repeats until the tolerance condition is met. The agent then observes the next state

s_{t + 1}

, and the transition tuple

(s_{t}, a_{t}, r_{t}, s_{t + 1})

is stored in the replay buffer, this process is depiceted in Figure 18. During policy updates, transitions are randomly sampled from the replay buffer to update the model as written in Algorithm 1.

Algorithm 1 Self-learning Algorithm for Spaghetti Grasping

1:: Initialize task and SAC parameters (e.g., target weight, gripper range, reward scale, network structure)
2:: Reset robot arm and zero stepper motor
3:: for episode = 1 to N do
4:: Initialize environment and agent
5:: for timestep $t = 0$ to T do
6:: Observe state $s_{t}$ and preprocess data
7:: if augmentation enabled then
8:: Apply image and physical data augmentation
9:: end if
10:: Extract features and generate action $a_{t} \sim π (s_{t})$
11:: Map $a_{t}$ to grasp pose $P_{t}$ and gripper width $G_{t}$
12:: Execute $a_{t}$ ; obtain reward $r_{t}$ and next state $s_{t + 1}$
13:: Store $(s_{t}, a_{t}, r_{t}, s_{t + 1})$ in replay buffer
14:: if $t % 99 = = 0$ then
15:: for each gradient step do
16:: Sample from buffer and update SAC:
17:: Update critic $Q_{θ}$ (Equation (11))
18:: Update actor $π$ (Equation (12))
19:: Update $α$ (Equation (13)) and target critic
20:: end for
21:: end if
22:: end for
23:: end for

5.2. Results

To validate the effectiveness of the learned policy, the experiment is performed in a real-world scenario using the robotic system described above. The training process is conducted over multiple episodes, followed by evaluation trials to assess grasping performance on noodle-like objects. We trained two models: SAC with data augmentation during training, and a baseline SAC model. The first model utilizes a custom policy architecture, as illustrated in Figure 15, to obtain

P_{t}

and

G_{t}

. Each model was trained for 5000 episodes using the reward function defined in Equation (7). The target weight for the policy to learn is set to 50 g of boiled spaghetti. During the training phase, the spaghetti was replaced every 1000 grasp attempts to maintain its quality, as prolonged use can lead to weight loss and increased susceptibility to damage. The agent’s performance during training is evaluated using learning curves. Since rewards can be noisy, we apply smoothing and normalization to constrain reward values within a range, enabling clearer monitoring throughout the training process, as shown in Figure 19. The SAC model with data augmentation shows better performance, whereas the baseline model exhibits quite oscillating behavior during training. The proposed model demonstrates strong performance for the noodle-like grasping task, as Figure 20 shows that the loss decreases over time. However, handling real spaghetti presents challenges due to fluctuations in its weight. The entropy coefficient of the proposed model stabilizes at approximately 0.6386, indicating a balanced trade-off between exploration and exploitation. This value suggests that the policy maintains sufficient exploration while avoiding excessive randomness.

In the evaluation phase, each trained model was tested over two trials, with 100 grasping attempts performed per trial. For each trial, we recorded the average grasped weight, the sample standard deviation of the grasped weights, and the percent error, which quantifies the deviation from the target weight as a percentage as shown in Table 3. The proposed model achieved strong performance in both trials, demonstrating higher accuracy in grasped weight compared to the baseline model, which showed lower consistency and precision. To evaluate the generalizability of the proposed model, we conducted additional experiments using different types of Japanese noodles to evaluate the generalizability of the proposed method for 50 grasping attempts. Champon, which features thicker and larger strands than spaghetti. Udon has strands that are larger than spaghetti but thinner than Champon. Finally, Soba has the thinnest strands among the tested noodles and the lowest overall weight.

The same policy trained using SAC with data augmentation was tested on Champon, Udon, and Soba noodles to assess generalization. The results obtained from experiments with different types of noodles demonstrate that the trained agent was able to perform grasping on Champon and Udon noodles. However, it failed to adapt its policy effectively for Soba noodles. The table below presents the average grasped weight, standard deviation, and percentage error for each noodle type as shown in Table 4.

6. Conclusions and Discussion

In this study, we presented a self-learning system for the task of grasping spaghetti designed to manipulate noodles without causing damage while achieving a specified target weight. Our system employs a robotic arm that mimics human-like motion, along with a custom-designed gripper specifically engineered to handle noodles delicately. We leverage deep reinforcement learning (DRL), using the Soft Actor-Critic (SAC) algorithm to enable autonomous learning. To enhance generalization and learning efficiency, we proposed a custom policy architecture incorporating data augmentation, which augments both visual and physical inputs, allowing the agent to experience a diverse set of scenarios during policy updates. The experiment focused on grasping spaghetti noodles to a specified weight of 50 g. The proposed model can achieve more stable learning curves and better accuracy compared to a baseline. Furthermore, we evaluated our model using different types of Japanese noodles—Champon, Udon, and Soba—to assess its generalization capabilities. The results demonstrate that the model was able to adapt to variations in noodle characteristics for both Champon and Udon. However, it failed to perform effectively with Soba, likely due to its lower overall weight and finer structural details, which made it more challenging for the agent to predict appropriate grasping actions. Nevertheless, these findings highlight the adaptability of the learned policy, even when applied to noodle types not encountered during training.

However, challenges remain in handling deformable objects in the real world, such as noodles. Variability in moisture content, strand thickness, and breakage over time introduce noise and inconsistency, particularly during long training sessions. Additionally, real spaghetti tends to degrade with repeated use, affecting both the weight and physical properties of the grasped object. These factors contribute to reward noise and variance in the evaluation.

7. Future Works

Future research can explore several directions to enhance the robustness of the spaghetti grasping task and to further enable the agent to make accurate predictions under unseen observations or when manipulating different types of noodles:

Training an agent for the spaghetti grasping task is challenging and time-consuming. When introducing a new set of configuration parameters, the agent must typically learn from scratch, which can be inefficient. To address this limitation, offline deep reinforcement learning (offline DRL) can be employed to provide the agent with a foundational understanding of noodle-like grasping environments. By leveraging a pretrained policy obtained through offline DRL, the learning process in new environments can be significantly accelerated, resulting in faster convergence and improved learning curves.
Another promising direction for future work is the integration of auxiliary learning to enhance grasp performance. Specifically, an auxiliary task could be introduced to predict grasp success probability alongside the main reinforcement learning objective. This parallel learning objective would allow the network to learn richer representations, improve sample efficiency, and provide an additional confidence signal during both training and inference. Such auxiliary prediction could also be used to guide action selection or filter suboptimal grasps, thereby improving the overall robustness and reliability of the system.
Sim-to-real transfer can also reduce the reliance on extensive real-world data collection. Training robotic agents entirely in simulation offers faster iteration and safer exploration, but discrepancies between simulated and real environments—commonly referred to as the reality gap—pose significant challenges. To address this, techniques such as domain randomization, domain adaptation, and fine-tuning with a small amount of real data can be explored. Incorporating more realistic physics and visual textures in simulation, as well as leveraging pretrained perception modules, could further narrow the gap. Ultimately, a robust sim-to-real strategy would enable more scalable and cost-effective deployment of robotic grasping systems in real-world applications.

Author Contributions

Conceptualization, P.G. and E.H.; methodology, P.G.; software, P.G. and N.K.; validation, P.G. and E.H.; formal analysis, P.G.; investigation, P.G. and E.H.; resources, E.H.; data curation, P.G.; writing—original draft preparation, P.G. and E.H.; writing—review and editing, A.M. and E.H.; visualization, P.G.; supervision, E.H.; project administration, E.H.; funding acquisition, E.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

DRL	Deep reinforcement learning
RL	Reinforcement learning
DCNN	Deep convolution neural network
TD	Temporal difference
DNN	Deep neural network
YOLO	You Only Look Once
MDP	Markov decision process
KF	Kalman filter
DoF	Degree of freedom
ROS	Robot operating system
SAC	Soft Actor-Critic
FCNN	Fully connected neural network

References

Kaur, N.; Sharma, A. Robotics and Automation in Manufacturing Processes. In Intelligent Manufacturing; CRC Press: Boca Raton, FL, USA, 2025; pp. 97–109. [Google Scholar]
Dzedzickis, A.; Subačiūtė-Žemaitienė, J.; Šutinys, E.; Samukaitė-Bubnienė, U.; Bučinskas, V. Advanced applications of industrial robotics: New trends and possibilities. Appl. Sci. 2021, 12, 135. [Google Scholar] [CrossRef]
Ono, K.; Hayashi, T.; Fujii, M.; Shibasaki, N.; Sonehara, M. Development for industrial robotics applications. IHI Eng. Rev. 2009, 42, 103–107. [Google Scholar]
de Guzman, P. How a Train Bento Box is Made in Japan. 2021. Available online: https://www.youtube.com/watch?v=eBPsaa0_RtQ&t=348s (accessed on 20 May 2025).
Ummadisingu, A.; Takahashi, K.; Fukaya, N. Cluttered food grasping with adaptive fingers and synthetic-data trained object detection. In Proceedings of the 2022 International Conference on Robotics and Automation (ICRA), Philadelphia, PA, USA, 23–27 May 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 8290–8297. [Google Scholar]
Endo, G.; Otomo, N. Development of a Food Handling Gripper Considering an Appetizing. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), Stockholm, Sweden, 16–21 May 2016; IEEE: Piscataway, NJ, USA, 2016. [Google Scholar]
Chen, Y.L.; Cai, Y.R.; Cheng, M.Y. Vision-based robotic object grasping—a deep reinforcement learning approach. Machines 2023, 11, 275. [Google Scholar] [CrossRef]
Du, G.; Wang, K.; Lian, S.; Zhao, K. Vision-based robotic grasping from object localization, object pose estimation to grasp estimation for parallel grippers: A review. Artif. Intell. Rev. 2021, 54, 1677–1734. [Google Scholar] [CrossRef]
Dewi, T.; Risma, P.; Oktarina, Y. Fruit sorting robot based on color and size for an agricultural product packaging system. Bull. Electr. Eng. Inform. 2020, 9, 1438–1445. [Google Scholar] [CrossRef]
Wang, Z.; Makiyama, Y.; Hirai, S. A soft needle gripper capable of grasping and piercing for handling food materials. J. Robot. Mechatronics 2021, 33, 935–943. [Google Scholar] [CrossRef]
Kazarian, K. Robots Reach for Food Processing. Available online: https://www.foodengineeringmag.com/articles/100208-robots-reach-for-food-processing (accessed on 18 April 2022).
Sutton, R.S. Reinforcement Learning: An Introduction; A Bradford Book; MIT Press: Cambridge, MA, USA, 2018. [Google Scholar]
Watkins, C.J.; Dayan, P. Q-learning. Mach. Learn. 1992, 8, 279–292. [Google Scholar] [CrossRef]
Mohammed, M.Q.; Chung, K.L.; Chyi, C.S. Pick and place objects in a cluttered scene using deep reinforcement learning. Int. J. Mech. Mechatron. Eng. 2020, 20, 50–57. [Google Scholar]
Guillaume, A. rl-taxi: Reinforcement Learning for Taxi Cab v3. 2023. Available online: https://github.com/gandroz/rl-taxi (accessed on 20 May 2025).
de Lange, E. Escape from a Maze Using Reinforcement Learning. 2022. Available online: https://github.com/erikdelange/Reinforcement-Learning-Maze (accessed on 20 May 2025).
Prasenjit, K. GridWorld: Gridworld Environment Creator for Testing RL Algorithms. 2024. Available online: https://github.com/prasenjit52282/GridWorld (accessed on 20 May 2025).
Low, J.H.; Khin, P.M.; Han, Q.Q.; Yao, H.; Teoh, Y.S.; Zeng, Y.; Li, S.; Liu, J.; Liu, Z.; y Alvarado, P.V.; et al. Sensorized reconfigurable soft robotic gripper system for automated food handling. IEEE/ASME Trans. Mechatron. 2021, 27, 3232–3243. [Google Scholar] [CrossRef]
Franco, L.; Turco, E.; Bo, V.; Pozzi, M.; Malvezzi, M.; Prattichizzo, D.; Salvietti, G. The double-scoop gripper: A tendon-driven soft-rigid end-effector for food handling exploiting constraints in narrow spaces. In Proceedings of the 2024 IEEE International Conference on Robotics and Automation (ICRA), Yokohama, Japan, 13–17 May 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 4170–4176. [Google Scholar]
Wang, Z.; Furuta, H.; Hirai, S.; Kawamura, S. A scooping-binding robotic gripper for handling various food products. Front. Robot. AI 2021, 8, 640805. [Google Scholar] [CrossRef] [PubMed]
Tulapornpipat, W. Picking-and-Placing Food with Vacuum Pad and Soft Gripper for Industrial Robot Arm. Master’s Thesis, Chulalongkorn University, Bangkok, Thailand, 2020. [Google Scholar]
Welch, G.; Bishop, G. An introduction to the kalman filter. In Proceedings of the SIGGRAPH, Course, Los Angeles, CA, USA, 12–17 August 2001; Volume 8, p. 41. [Google Scholar]
Ahmed, M.; Qari, K.; Kumar, R.; Lall, B.; Kherani, A. Vision-based human wrist localization and with Kalman-filter backed stabilization for Bilateral Teleoperation of Robotic Arm. Procedia Comput. Sci. 2024, 235, 264–273. [Google Scholar] [CrossRef]
Sun, R.; Wu, C.; Zhao, X.; Zhao, B.; Jiang, Y. Object Recognition and Grasping for Collaborative Robots Based on Vision. Sensors 2023, 24, 195. [Google Scholar] [CrossRef] [PubMed]
Prasad, S. Application of robotics in dairy and food industries: A review. Int. J. Sci. Environ. Technol. 2017, 6, 1856–1864. [Google Scholar]
Candemir, A.; Can, F.C. Pick &Place Task Implementation of a Scara Manipulator via Robot Operating System and Machine Vision. Available online: https://www.kalaharijournals.com/resources/JUNE-49.pdf (accessed on 20 May 2025).
Wang, Z.; Hirai, S.; Kawamura, S. Challenges and opportunities in robotic food handling: A review. Front. Robot. AI 2022, 8, 789107. [Google Scholar] [CrossRef] [PubMed]
Urrea, C.; Kern, J. Recent Advances and Challenges in Industrial Robotics: A Systematic Review of Technological Trends and Emerging Applications. Processes 2025, 13, 832. [Google Scholar] [CrossRef]
Haarnoja, T.; Zhou, A.; Abbeel, P.; Levine, S. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Proceedings of the International Conference on Machine Learning, PMLR, Stockholm, Sweden, 10–15 July 2018; pp. 1861–1870. [Google Scholar]
Chen, K.; Pang, J.; Wang, J.; Xiong, Y.; Li, X.; Sun, S.; Feng, W.; Liu, Z.; Shi, J.; Ouyang, W.; et al. Hybrid task cascade for instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 4974–4983. [Google Scholar]

Figure 1. (a) illustrates a typical lunch box assembly process in a large-scale company, where food items are manually placed into lunch boxes along the production line from [4]. In contrast, (b) depicts a scenario where specific food items require precise portioning. A scale is used to determine the correct amount of food before it is assembled into the lunch box.

Figure 2. The robot arm is moving stacked palettes from the production line to the shelf nearby [11].

Figure 3. Interaction between agent–environment based on Markov decision process (MDP).

Figure 4. The simple tasks that can be solved by classic RL algorithms are Taxi-V3 [15], Escape from Maze [16], Gridworld Navigation [17], and Tic-tac-toe, respectively. In Taxi-v3, the letters in the environment (R, G, Y, B) indicate possible passenger pick-up and drop-off locations.

Figure 5. (A) Overview of our robot system designed to autonomously grasp Japanese foods for a lunch box using the Yaskawa Motoman robot arm. (B) Dual-tool gripper designed to handle multiple food types without requiring tool changes. This gripper was developed in the lab as part of this study [21], is capable of grasping non-rigid food, and includes a vacuum tool pad for handling flat-shaped food. (C) End tool designed for grasping noodle-like foods, such as spaghetti. The small photo below illustrates the extended tool for grasping larger quantities of food. (D) Gripper tool utilizing a stepper motor for precise positional control to grasp larger objects.

Figure 6. This figure illustrates the sequential steps in the pick-and-place task for food handling developed in our laboratory. (A) Detect the food object, determine the pose for the robot to grasp, and identify where to place the food after a successful grasp. (B) The robot starts moving to the target goal. (C) The robot reaches the target goal and grasps the food. (D) The robot picks up the food. (E,F) The robot moves to the detected compartment of the lunch box. (G) The robot places the grasped object in the lunch box. (H,I) The robot moves back to the home position. (J) The robot reaches the home position and is ready to grasp the next object.

Figure 7. This figure displays the eye-in-hand view of the robotic system: (A) the raw RGB image captured by the system, and (B) the detection and segmentation of Japanese food items. The objects depicted in the image consist of lunch boxes, spaghetti, fried chicken, ham, and onigiri.

Figure 8. (a) Illustration of the manual spaghetti picking process from a tray to a lunch box on a moving conveyor by a human operator: (A) selecting the target compartment; (B) observing and grasping the spaghetti; (C) preparing the spaghetti and deciding the placement location; (D,E) placing the spaghetti into the compartment; and (F) task completion. If the placement is unsuccessful, the spaghetti is returned to the tray, and the process is repeated. and (b) shows the flowchart of decision making when performing this task.

Figure 9. Problem formulation of this study formulated on a Markov decision process (MDP).

Figure 10. This figure illustrates: (A) a 7-DoF articulated robot arm, the Yaskawa Motoman SIA5F; (B) the gripper tool designed for grasping noodle-like food items; and (C) the side view of the gripper tool, showing the mounted RGB-D camera, force-torque sensor (not utilized in this study), and finger components.

Figure 11. This figure illustrates the designed gripper grasping spaghetti. The slightly curved structure of the gripper enables it to hold the noodles securely.

Figure 12. This figure shows the hardware connection in the system.

Figure 13. The robot system visualization in RViz displays the robot grasping spaghetti along with image data from the camera, including RGB, depth, and mask images.

Figure 14. The feature extractor is designed to generate the action

a_{t}

for grasping the noodles. It utilizes a pretrained ResNet-50 to extract visual features, which are then passed through custom linear layers, along with the grasping weight and the weight of the spaghetti in the tray, to be input into the SAC network.

Figure 14. The feature extractor is designed to generate the action

a_{t}

for grasping the noodles. It utilizes a pretrained ResNet-50 to extract visual features, which are then passed through custom linear layers, along with the grasping weight and the weight of the spaghetti in the tray, to be input into the SAC network.

Figure 15. The proposed custom policy function maps the state

S_{t}

to the action

A_{t}

, enabling the robot to learn self-grasping. This policy is designed to process high-dimensional visual inputs and physical properties, extracting and fusing relevant features before generating grasping actions.

Figure 15. The proposed custom policy function maps the state

S_{t}

to the action

A_{t}

, enabling the robot to learn self-grasping. This policy is designed to process high-dimensional visual inputs and physical properties, extracting and fusing relevant features before generating grasping actions.

Figure 16. Conversion of output action to 2D coordinates. The action outputs

p_{x}

and

p_{y}

are mapped to image resolution, generating 2D coordinates

(i, j)

. The spaghetti marker ensures valid graspable regions, while depth Z is extracted and transformed into the robot base frame for motion planning.

Figure 16. Conversion of output action to 2D coordinates. The action outputs

p_{x}

and

p_{y}

are mapped to image resolution, generating 2D coordinates

(i, j)

. The spaghetti marker ensures valid graspable regions, while depth Z is extracted and transformed into the robot base frame for motion planning.

Figure 17. This figure illustrates the robot’s motion execution during training. (A) The robot and gripper are reset to the home position. (B) The robot moves to the target pose

P_{t}

. (C) The robot reaches the target pose. (D) The robot grasps the spaghetti according to

G_{t}

. (E) The robot gently lifts from the grasp pose. (F) The robot moves to a predefined joint position to measure the grasping weight. (G–I) The robot moves to a random drop pose. (J) The agent releases the grasped spaghetti. (K) The robot returns to the home position. (L) The agent observes the environment and updates the policy.

Figure 17. This figure illustrates the robot’s motion execution during training. (A) The robot and gripper are reset to the home position. (B) The robot moves to the target pose

P_{t}

. (C) The robot reaches the target pose. (D) The robot grasps the spaghetti according to

G_{t}

. (E) The robot gently lifts from the grasp pose. (F) The robot moves to a predefined joint position to measure the grasping weight. (G–I) The robot moves to a random drop pose. (J) The agent releases the grasped spaghetti. (K) The robot returns to the home position. (L) The agent observes the environment and updates the policy.

Figure 18. This figure illustrates the training diagram based on MDP.

Figure 19. This compares the learning curves of the model utilizing data augmentation and the baseline model, both of which are trained using the SAC algorithm.

Figure 20. This illustrates the following over 5000 training episodes for both models: (A) actor loss, (B) critic loss, (C) entropy coefficient, and (D) entropy coefficient loss.

Table 1. Comparison of different grippers developed for Japanese food handling.

Gripper Type	Design	Actuation Method	Suitable Food Types	Features
Dual-Scoop Gripper	Scoop-shaped concave jaws	Pneumatic	Fried chicken, Onigiri	High adaptability to shape
Parallel Gripper	Finray four-finger design	Stepping Motor	Sausage-like	Strong grip for firm items
Suction Gripper	Vacuum suction	Pneumatic	Flat or sealed surfaces	Minimal contact, hygienic

Table 2. This table shows the PC specification utilized in this study.

Item	Value
Operating System	Linux Ubuntu 20.04 LTS
Processor	12th Gen Intel^® Core™ i9-12900K×24
Memory	64 GB
Graphic Card	Nvidia GeForce RTX3090 24 GB
Disk Capacity	4 TB

Table 3. This table presents the results of the two trained models.

Model	Trial	Average Grasped Weight (g)	Standard Deviation (g)	Percent Error (%)
SAC + Augmentation	First	50.2408	5.6022	0.4816
SAC + Augmentation	Second	48.1213	5.2384	3.7573
SAC	First	45.0747	14.0797	9.8504
SAC	Second	42.1971	15.7234	15.6056

Table 4. This table presents the trained model deployed on Champon, Udon, and Soba noodles.

Noodles	Average Grasped Weight (g)	Standard Deviation (g)	Percent Error (%)
Udon	50.2472	5.8464	0.4944
Champon	48.7638	5.9325	2.4723
Soba	36.211	60.932	27.576

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Gamolped, P.; Koomklang, N.; Mowshowitz, A.; Hayashi, E. Autonomous Grasping of Deformable Objects with Deep Reinforcement Learning: A Study on Spaghetti Manipulation. Robotics 2025, 14, 113. https://doi.org/10.3390/robotics14080113

AMA Style

Gamolped P, Koomklang N, Mowshowitz A, Hayashi E. Autonomous Grasping of Deformable Objects with Deep Reinforcement Learning: A Study on Spaghetti Manipulation. Robotics. 2025; 14(8):113. https://doi.org/10.3390/robotics14080113

Chicago/Turabian Style

Gamolped, Prem, Nattapat Koomklang, Abbe Mowshowitz, and Eiji Hayashi. 2025. "Autonomous Grasping of Deformable Objects with Deep Reinforcement Learning: A Study on Spaghetti Manipulation" Robotics 14, no. 8: 113. https://doi.org/10.3390/robotics14080113

APA Style

Gamolped, P., Koomklang, N., Mowshowitz, A., & Hayashi, E. (2025). Autonomous Grasping of Deformable Objects with Deep Reinforcement Learning: A Study on Spaghetti Manipulation. Robotics, 14(8), 113. https://doi.org/10.3390/robotics14080113

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Autonomous Grasping of Deformable Objects with Deep Reinforcement Learning: A Study on Spaghetti Manipulation

Abstract

1. Introduction

2. Literature Reviews

2.1. Automation and Robotics in Food Handling

2.2. Deep Reinforcement Learning for Robotic Manipulation

2.3. Robotic Handling of Deformable and Delicate Objects

2.4. Multi-Sensor Integration in Robotic Systems

2.5. Applications and Limitations of Robotics in Food Automation

3. Methodology

3.1. Noodle-like Grasping Problem

3.2. Problem Formulation

3.3. System Overview

3.3.1. Hardware System

3.3.2. Software System

4. Noodle-like Grasping Framework

5. Experimental Setup and Results

5.1. Experiment Setup

5.2. Results

6. Conclusions and Discussion

7. Future Works

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI