Contact-Aware Diffusion Sampling for RRT-Based Manipulation

Lee, Kyoungho; Cho, Kyunghoon

doi:10.3390/electronics14244837

Open AccessArticle

Contact-Aware Diffusion Sampling for RRT-Based Manipulation

by

Kyoungho Lee

and

Kyunghoon Cho

^*

Department of Information and Telecommunication Engineering, Incheon National University, Incheon 22012, Republic of Korea

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(24), 4837; https://doi.org/10.3390/electronics14244837

Submission received: 10 November 2025 / Revised: 4 December 2025 / Accepted: 5 December 2025 / Published: 8 December 2025

(This article belongs to the Special Issue Intelligent Perception and Control for Robotics)

Download

Browse Figures

Versions Notes

Abstract

Rapidly exploring Random Trees (RRT) provide probabilistic completeness but often explore inefficiently in high-DOF manipulation tasks. We address this by proposing a contact-aware, two-level planner that couples a learned toggle–subgoal predictor with a conditional diffusion sampler in joint space under a completeness-preserving mixture with uniform sampling. An upper ResNet-based network predicts task-relevant milestones from RGB images: grasp/release “toggle” configurations and intermediate joint-space subgoals that serve as phase-wise, receding-horizon targets between consecutive contact events. Conditioned on these predictions and the current state, a lower-level diffusion model samples tree-extension segments—joint-space directions and step lengths—instead of absolute configurations. These proposals act as a drop-in replacement for uniform sampling in standard RRT/RRT-Connect, while a nonzero fraction of uniform samples preserves probabilistic completeness. By biasing growth toward contact-relevant regions, the planner concentrates the search near feasible approach manifolds without altering nearest-neighbor, steering, or collision-checking primitives. In mug pick-and-place simulations, the proposed method achieves higher success rates than diffusion and other sequence-based policies trained by imitation learning, and requires fewer RRT expansions than uniform and goal-biased RRT as well as prior learning-guided samplers based on CVAE and conditional GAN, under identical collision checking and iteration limits.

Keywords:

sampling-based motion planning; RRT; conditional diffusion; contact-aware manipulation; learning-guided planning

1. Introduction

Sampling-based motion planning (SBMP) is a standard approach for high-dimensional and dynamically constrained planning. Rather than explicitly discretizing the state space, SBMP draws candidate states in the feasible set and connects them with local steering while delegating feasibility to collision checking. Classical sampling strategies (random or low-discrepancy) provide broad coverage so that, as the number of samples increases, the planner attains probabilistic completeness and, for suitable variants, asymptotic optimality [1,2,3].

Within SBMP, RRT and its descendants (e.g., RRT-Connect, RRT*) are widely used in both manipulation and navigation. In manipulation, they are combined with task-space goals, kinematic limits, and collision checking to plan whole-arm motions in clutter (e.g., shelf insertion, bin picking, peg-in-hole). In navigation and autonomous driving, RRT-style planners handle kinodynamic constraints and long-horizon obstacle avoidance. These uses benefit from modular components (nearest-neighbor search, steering, and collision checks), but the core sampling mechanism usually remains uniform or only weakly informed.

In practice, feasible motions concentrate in a small, structured subset of the state space due to the environment (start/goal placement, clutter, narrow passages), the system (e.g., stability regions), and implicit constraints (e.g., loop closures, multi-robot separation). Uniform sampling eventually explores such regions, but only through exhaustive trial and error, limiting efficiency and the ability to reuse prior knowledge.

Recent progress in robot learning shows that imitation learning (IL) is effective for visuomotor control—especially in manipulation—by training policies from demonstrations. In parallel, diffusion policies model action sequences or low-level controls via denoising diffusion, enabling multi-modal behaviors from demonstrations. Despite these advances, purely end-to-end controllers often degrade under covariate shift, sensing noise, or contact discontinuities and typically lack an explicit search component. A complementary direction is to retain a search-based planner while learning informative sampling distributions or heuristics, so that the planner preserves coverage and learning provides bias.

A body of work explores learning to guide SBMP by shaping sampling distributions, estimating heuristics, or proposing local connections (e.g., learned samplers for RRT/RRT*, neural encoders that map scenes to promising states, or networks that infer steering targets) [4,5]. While these methods accelerate planning, two gaps remain susceptible to manipulation: (i) capturing the multi-modal structure of promising joint-space regions (e.g., multiple IK branches, distinct approach corridors), and (ii) incorporating contact semantics (grasp/release timing) so that samples align with task events rather than only geometric proximity.

We propose a contact-aware SBMP framework for manipulation that combines (i) A learned toggle–subgoal predictor. (ii) A conditional diffusion sampler operating directly in joint space. The upper network takes a scene RGB image together with the current joint configuration and gripper mode and predicts (a) a toggle configuration at which the gripper should switch between open and close, (b) the corresponding toggle mode, and (c) an intermediate joint-space subgoal that serves as a phase-wise, receding-horizon target for the planner: for each phase, we plan from the current configuration to the predicted subgoal, execute the resulting motion (toggling the gripper near the toggle configuration), and then re-query the upper network at the new state. Conditioned on the image and these predictions, along with the current robot state, the lower-level diffusion model proposes tree-extension segments—a joint-space direction and a positive step length—that replace and augment the sampling step in RRT/RRT-Connect. Proposals are based on a mixture of the learned sampler and a uniform policy; by maintaining a nonzero fraction of uniform proposals, the planner preserves probabilistic completeness while biasing growth toward contact- and task-relevant regions. Nearest-neighbor search, steering, and collision checking remain unchanged.

We make the following contributions:

Toggle–subgoal prediction for contact-aware planning. We introduce a ResNet-based predictor that maps a single RGB image and the current robot state (joints and gripper mode) to a toggle configuration, its toggle mode, and an intermediate joint-space subgoal. This exposes grasp/release timing and provides task-aligned guidance for where the RRT tree should expand within each short planning phase on the way to the next contact event.
Diffusion over joint-space extension segments. We learn a conditional Denoising Diffusion Probabilistic Model (DDPM) [6] that samples joint-space directions and step lengths for RRT tree extensions instead of absolute configurations. The sampler is conditioned on the scene image, current configuration, and predicted toggle/subgoal, and plugs into RRT/RRT-Connect without modifying nearest-neighbor search, steering, or collision checking.
Mixture-based completeness with receding-horizon targets. We mix diffusion-based segment proposals with a nonzero fraction of uniform segments, preserving probabilistic completeness while biasing the search toward contact- and task-relevant regions. Using the predicted subgoal as the current target in each phase yields a receding-horizon planner that aligns local tree growth with task progress.
Simulation evidence on contact-rich mug pick-and-place. For a contact-rich mug pick-and-place benchmark with multiple grasp/release events, our planner achieves higher success rates than imitation-only baselines and requires fewer RRT expansions than uniform/goal-biased RRT and prior learned samplers under identical collision checking and iteration limits.

2. Related Work

Beyond classical RRT/RRT* baselines, a large body of work focuses on where to sample to accelerate the search. Informed RRT* restricts sampling to an ellipsoidal subset that can still improve the current solution, yielding large speedups when high-quality paths exist [7]. Numerous RRT* variants pursue constrained or hybrid sampling to emphasize promising regions while preserving coverage properties. These methods handcraft informative regions, whereas we learn them from data and inject contact semantics via predicted toggle states.

Learning-based guidance replaces uniform sampling with data-driven priors or proposes local connections. Ichter et al. learn sampling distributions conditioned on a problem context and inject them into RRT, demonstrating substantial speedups [4]. MPNet encodes workspace observations and predicts states that guide a planner or produce end-to-end paths [5]. More recently, Neural-Informed RRT* learns to predict informed regions for RRT*, improving convergence by biasing samples toward promising subsets of the state space [8]. Generative models such as CVAEs and conditional GANs have also been used to learn sampling distributions for RRT-style planners (e.g., in the spirit of [9,10]), and we adopt CVAE- and conditional-GAN–based samplers as baselines in our experiments. Our approach shares the goal of learned guidance but differs by (i) operating directly in joint space with a diffusion model over tree-extension segments rather than absolute states, and (ii) conditioning on a predicted toggle configuration to inject contact semantics into the sampling process.

Diffusion has proven effective for multi-modal visuomotor policies and trajectory generation. Diffusion Policy learns time-indexed action distributions from demonstrations and shows strong manipulation performance [11]; in our experiments, we include a Diffusion Policy-style baseline adapted to our single-RGB-camera setting. Diffuser plans by denoising full trajectories, bridging generative modeling and decision-making [12]. Motion Planning Diffusion generates feasible motions with diffusion and compares against standard planning baselines on simulated manipulators [13]. There is also interest in model-based diffusion that leverages differentiable dynamics or costs for trajectory optimization [14]. In contrast, we do not replace the planner: we keep RRT/RRT-Connect and use diffusion only to sample extension segments under a mixture with uniform sampling, thereby preserving probabilistic completeness.

Many learning-based planners and policies capture geometry but underutilize task events such as grasp/release timing. Prior work commonly relied on goal poses, grasp detectors, or end-effector waypoints to guide planning or control [15,16,17], and task-and-motion planning frameworks explicitly model discrete gripper actions alongside continuous motion [18,19]. There is also existing work that focuses on contact-state estimation and force/impedance control for contact-rich skills [20,21]. However, learning a joint-space toggle configuration and using it as a conditioning signal to bias SBMP sampling has, to the best of our knowledge, received limited attention. We take this route and condition a diffusion-based sampler on a predicted toggle configuration (and subgoal), aligning the search with contact-relevant regions of the configuration space while retaining SBMP’s completeness properties through a mixture with uniform sampling.

3. Preliminaries

3.1. System Model and Notation

We consider a d-DOF manipulator with configuration (joint) space

Q \subset R^{d}

and task space (workspace)

W \subset R^{3} \times SO (3)

, where

SO (3)

denotes the space of 3D rotations. We also use the generic SBMP notation

X

when referring to abstract state spaces.

$X \subset R^{n}$ : generic state space (for SBMP discussion).
$X_{obs} \subset X$ , $X_{free} = X ∖ X_{obs}$ (obstacle and collision-free subsets).
$Q_{\lim} \subset Q$ : joint-limit–feasible set (per-joint bounds enforced).
$f_{FK} : Q \to W$ : forward kinematics to end-effector pose (used only if task-space goals are considered).
$q_{start} \in Q_{\lim}$ , $Q_{term} \subseteq Q_{\lim}$ (start configuration and terminal goal region in joint space).
State and observations: the current joint configuration is $q_{curr} \in Q_{\lim}$ and the current gripper mode is $g_{curr} \in {open, close}$ . The observation consists of an RGB image $I \in R^{H \times W \times 3}$ and $(q_{curr}, g_{curr})$ .
Toggle and subgoal: the upper network predicts a toggle configuration $q_{toggle} \in Q_{\lim}$ , a toggle mode $g_{toggle} \in {open, close}$ , and a joint-space subgoal $q_{goal} \in Q_{\lim}$ used as a phase-wise receding-horizon target between consecutive toggle events.
Distance and tolerance: unless stated otherwise, the tree metric is a (possibly weighted) Euclidean distance

$d (q, q^{'}) = {∥ W (q - q^{'}) ∥}_{2},$

where $W \in R^{d \times d}$ is a symmetric positive-definite (typically diagonal) weight matrix over joints.

For joint-limit projection, we write

Π_{\lim} : Q \to Q_{\lim}

, which clips/projects any proposed configuration to satisfy limits.

3.2. Geometric Motion-Planning Problem

Let

Q \subset R^{d}

be the configuration space,

Q_{\lim} \subset Q

the joint-limit–feasible set, and let

Q_{obs} \subset Q

denote collision configurations under the robot–environment model, with

Q_{free} = Q ∖ Q_{obs}

. A path is a continuous map

s : [0, 1] \to Q

with

s (0) = q_{start}

and

s (1) \in Q_{term}

.

Collision-free: $s (τ) \in Q_{\lim} \cap Q_{free}$ for all $τ \in [0, 1]$ .
Feasible path: a collision-free path from $q_{start}$ to $Q_{term}$ .

We target single-query geometric planning. In experiments, we primarily report the success rate; timing and check-count metrics are optional.

3.3. RRT Primitives

Let

T = (V, E)

be the search tree in

Q

and let d denote the (possibly weighted) Euclidean metric used for nearest-neighbor queries. We also fix a goal tolerance

τ_{goal} > 0

so that a tree node

v \in V

is said to reach a target

q_{tgt}

if

d (v, q_{tgt}) \leq τ_{goal}

.

$Nearest (T, q)$ : returns $arg {min}_{v \in V} d (v, q)$ .
$Steer (q, q_{target}, Δ)$ : local extension from q toward $q_{target}$ with step size (or cap) $Δ$ .
$CollisionFree (q, q^{'})$ : true if the straight-line (or local steering) motion from q to $q^{'}$ lies in $Q_{\lim} \cap Q_{free}$ (checked continuously).
Tree-extension segment: an ordered pair $(u, s)$ with unit direction $u \in R^{d}$ ( $∥ u ∥ = 1$ ) and step length $s > 0$ , producing the proposal

$q_{new} = Π_{\lim} (q + s u),$

where $Π_{\lim}$ enforces joint limits prior to collision checking along the induced local motion.

This segment-based formulation matches the learned sampler introduced in Section 4.3.

3.4. Sensing and Task Assumptions

We assume RGB scene observations and a binary gripper.

Observation: a scene image $I \in R^{H \times W \times 3}$ and the current robot state $(q_{curr}, g_{curr})$ .
Gripper mode: $g \in {open, close}$ .
Toggle state: a joint configuration at which the gripper should switch between modes for the task at hand (used as a contact-aware cue for sampling).

4. Proposed Method

4.1. Overview

We adopt the notation from Section 3. In particular,

Q

is the joint space,

Q_{\lim} \subset Q

denotes the joint-limit–feasible set, and

q_{start}

and

Q_{term}

denote the start and the terminal goal region. We propose a two-level framework that couples learned contact semantics with a sampling-based search (Figure 1). The upper level is a ResNet-based network that takes a single RGB image together with the current joint configuration and gripper mode and predicts three quantities: a toggle configuration at which the gripper should switch, the corresponding toggle mode, and a joint-space subgoal used as a phase-wise receding-horizon target.

The lower planner treats the predicted subgoal as the current RRT target for the current phase and runs a diffusion-guided RRT/RRT-Connect from the current configuration toward this subgoal, drawing candidate segments from a mixture of the learned sampler and a uniform segment policy. Once a path is found and executed (with the gripper toggled near the predicted toggle configuration when appropriate), the resulting state becomes the start of the next planning phase, where the upper network is queried again. Repeating this procedure yields a phase-wise receding-horizon behavior that progresses through the sequence of contact events in the task.

4.2. Upper Network: Toggle–Subgoal Predictor

Architecture. Given

(I, q_{curr}, g_{curr})

, a ResNet encoder extracts a visual embedding from I and fuses it with

(q_{curr}, g_{curr})

via lightweight MLP layers. Three prediction heads branch from shared features: (i) A regression head for the toggle configuration

q_{toggle}

. (ii) A binary classification head for the toggle mode

g_{toggle} \in {open, close}

, and (iii) a regression head for the joint-space subgoal

q_{goal}

, which serves as a phase-wise receding-horizon target between consecutive toggle events. We apply standard feature normalization and dropout to the heads. Since all experiments are performed via simulations, only basic resizing/cropping is used for images.

Supervision and losses. From expert executions, we detect grasp/release events to label

(q_{toggle}^{★}, g_{toggle}^{★})

and extract approach subgoals

q_{goal}^{★}

along the corresponding segments (e.g., just before contact or aligned with task progression). In principle,

q_{goal}^{★}

is chosen slightly before the toggle configuration along the approach trajectory; however, whenever this pre-contact waypoint is already within a small tolerance of

q_{toggle}^{★}

(in joint space), we simply set

q_{goal}^{★} = q_{toggle}^{★}

for that phase.

The total loss is a weighted sum of two robust regressions and one binary classification, plus a mild regularizer enforcing joint-limit feasibility (and, optionally, proximity between predicted toggle and subgoal):

\begin{matrix} L_{toggle-subgoal} & = w_{q} Huber (q_{toggle}^{pred} - q_{toggle}^{★}) \\ + w_{g} CE (σ (z_{toggle}^{pred}), g_{toggle}^{★}) \\ + w_{s} Huber (q_{goal}^{pred} - q_{goal}^{★}) \\ + w_{r} R . \end{matrix}

(1)

Here,

Huber (\cdot)

is applied element-wise per joint and averaged; the binary cross-entropy

CE (p, y) = - [y log p + (1 - y) log (1 - p)], y \in {0, 1}, p = σ (z) \in (0, 1),

uses the logit

z_{toggle}^{pred}

and sigmoid

σ (\cdot)

. Weights

w_{q}, w_{g}, w_{s}, w_{r} > 0

balance the terms (typical ranges:

w_{q}, w_{s} \in [1, 10]

,

w_{g} \in [0.5, 2]

,

w_{r} \in [10^{- 4}, 10^{- 1}]

). Joint values are in radians (optionally normalized).

Regularization: We use

R = λ_{\lim} (R_{\lim} (q_{toggle}^{pred}) + R_{\lim} (q_{goal}^{pred})) + λ_{prox} ∥ q_{goal}^{pred} - q_{toggle}^{pred} ∥_{2}^{2},

(2)

with a soft hinge outside tightened limits:

R_{\lim} (q) = \sum_{j = 1}^{d} [max {(0, q_{j} - (q_{j}^{max} - δ))}^{2} + max {(0, (q_{j}^{min} + δ) - q_{j})}^{2}],

(3)

where

δ > 0

discourages sitting on the bounds,

λ_{\lim}

controls limit penalties, and

λ_{prox}

(kept relatively small) softly encourages the predicted subgoal

q_{goal}^{pred}

not to deviate excessively from the predicted toggle

q_{toggle}^{pred}

when appropriate. Standard weight decay is handled by the optimizer and excluded from

R

. This term mainly affects phases where the demonstrated subgoal lies close to the toggle, acting as a weak consistency prior between the two heads.

4.3. Lower Planner: Diffusion-Guided Segment Sampling

Learning a sampling distribution (via diffusion). We operate on tree-extension segments rather than absolute configurations. Given a base node

q \in Q

, a segment is an ordered pair

(u, s)

with unit direction

u \in R^{d}

(

∥ u ∥ = 1

) and step length

s > 0

, producing the proposal

q_{new} = Π_{\lim} (q + s u),

(4)

followed by continuous collision checking along the local motion; proposals that collide or violate limits are rejected. The diffusion sampler

D_{θ}

generates

(u, s)

conditionally on

cond = (I, q_{curr}, q_{goal}, q_{toggle}, g_{toggle}),

where

q_{goal}

is the upper-level subgoal (current RRT target within the phase),

q_{curr}

is the current configuration at which the phase starts, and I denotes the current RGB image for that phase. An optional lightweight tree summary

TreeContext (T)

(e.g., node count, recent expansion directions) can be included.

Diffusion objective and architecture. Let

x_{0} \in R^{d + 1}

be the clean segment vector obtained by concatenating the target direction and step,

x_{0} : = [u^{★}; s^{★}]

. We use the standard DDPM forward noising process with a variance schedule

{β_{t}}_{t = 1}^{T}

and

α_{t} = 1 - β_{t}

,

{\bar{α}}_{t} = \prod_{k = 1}^{t} α_{k}

:

x_{t} = \sqrt{{\bar{α}}_{t}} x_{0} + \sqrt{1 - {\bar{α}}_{t}} ϵ, ϵ \sim N (0, I), t \sim U {1, \dots, T} .

Here,

x_{t}

is the noised sample at step t. We train a noise-prediction model

ϵ_{θ}

conditioned on

cond = (I, q_{curr}, q_{goal}, q_{toggle}, g_{toggle})

:

L_{diff} = E_{t, ϵ} ∥ ϵ - ϵ_{θ} (x_{t}, cond, t) ∥_{2}^{2} .

(5)

In our implementation, the denoiser

ϵ_{θ}

is instantiated as a lightweight 1D U-Net similar in spirit to Diffusion Policy [11], but operating on a single

(d + 1)

-dimensional segment vector

(u, s)

instead of a full action sequence. The global conditioning vector is constructed by concatenating (i) A ResNet-18–GN + spatial softmax encoder applied to the single RGB image I. (ii) An MLP embedding of the current joint configuration and gripper mode.

(q_{curr}, g_{curr})

, and (iii) an MLP embedding of the predicted toggle and subgoal

(q_{toggle}, q_{goal}, g_{toggle})

. This concatenated feature is projected to a 256-dimensional vector and injected into each residual block of the U-Net via FiLM-style affine modulation (scale–shift on activations). The diffusion timestep t is embedded with sinusoidal position encodings and concatenated into the same conditioning pathway.

We train with the standard

ϵ

-prediction objective in Equation (5) using a cosine noise schedule with

T = 100

diffusion steps. At test time, we use a deterministic DDIM sampler with

T = 25

steps (and

T = 10

in the ablations), obtaining

{\hat{x}}_{0} = [\hat{u}; \hat{s}]

, normalizing

\hat{u}

to unit length and clipping

\hat{s}

to

[s_{min}, s_{max}]

, and finally forming a tree-extension segment via Equation (4).

Base-node selection and uniform segments. We use a routine

SelectBase (T, target, strategy)

to choose the node

q_{base} \in V

from which an extension is attempted: nearest-to-target (for learned proposals),

q_{base} = arg {min}_{v \in V} d (v, q_{goal})

, which aligns growth toward the subgoal; and rrt-default (for uniform proposals),

q_{base} = Nearest (T, q_{rand})

, which recovers the classical RRT behavior. Optionally, ties can favor nodes nearer to

q_{toggle}

during grasp/release phases. The routine UniformSegment() samples a unit direction u uniformly on the sphere and a positive step s from a bounded distribution, then clips s to

[s_{min}, s_{max}]

.

Algorithm and integration with RRT/RRT-Connect (phase-wise planning and execution). Algorithm 1 summarizes this phase-wise planning–execution loop. We fix a goal tolerance

τ_{goal} > 0

so that a tree node v is said to reach

q_{goal}

if

d (v, q_{goal}) \leq τ_{goal}

. At the beginning of each short planning phase, we acquire the current RGB image and query the upper network

F_{ψ}

on the current observation to predict

(q_{toggle}, g_{toggle}, q_{goal})

. Within that phase, at each tree expansion, we draw a candidate segment either from the conditional diffusion sampler or, with nonzero probability

p_{uni} > 0

, from UniformSegment(). Once a node sufficiently close to

q_{goal}

is found (or an iteration limit is reached), we stop planning, extract and smooth a path segment, and execute it on the robot; the gripper is toggled to

g_{toggle}

when the trajectory passes near

q_{toggle}

. The resulting state

(q_{curr}, g_{curr})

becomes the start of the next planning phase, and the procedure is repeated with a fresh image and a new query to

F_{ψ}

.

In practice, several such planning phases may be chained on the way to the same physical grasp or release event, so multiple successive subgoals can be predicted before the corresponding toggle configuration is actually reached. Nearest-neighbor search, steering, and collision checking remain unchanged, and the same interface applies to RRT-Connect by expanding two trees symmetrically and attempting to connect them when appropriate.

In our implementation, the upper network

F_{ψ}

is queried once per planning–execution phase (one iteration of the outer while-loop in Algorithm 1). At the beginning of each phase, we render the current RGB image and form the observation

(I_{curr}, q_{curr}, g_{curr})

, predict

(q_{toggle}, g_{toggle}, q_{goal})

, and then keep these predictions fixed while the RRT/RRT-Connect planner runs for up to

N_{rrt}

expansions toward

q_{goal}

. After a path segment is found and executed, the resulting state

(q_{curr}, g_{curr})

becomes the start of the next phase, at which point we render a new image and query

F_{ψ}

again. Thus, the re-query frequency of the upper network is tied to how often we re-plan (per phase), rather than to the number of individual RRT expansions. A single physical toggle event (e.g., the first grasp) may be reached over several successive phases as the predicted subgoals progressively move the robot closer to the toggle configuration.

Completeness under mixture. Let

λ \in [0, 1)

denote the fraction of learned proposals and

(1 - λ)

the fraction of uniform proposals selected at each expansion (equivalently,

p_{uni} = 1 - λ

in Algorithm 1). Since

(1 - λ) > 0

, every open subset of the free configuration space retains nonzero sampling probability from the uniform component. Therefore, the planner inherits the probabilistic completeness behavior of RRT: as the total number of proposals

N \to \infty

, the probability of finding a feasible solution (if one exists) converges to one. This follows the same argument used for learned sampling mixtures in [4], where theoretical guarantees hold after replacing the number of samples N with the number of uniform samples

(1 - λ) N

. Intuitively, adding learned samples can only accelerate the search, while the uniform sub-sampling preserves coverage of

Q_{free}

.

Path post-processing (smoothing and parameterization): After a successful search, we apply two-stage post-processing: (i) Shortcut smoothing that attempts to establish direct connections between random waypoint pairs with continuous collision checks and limit projection. (ii) A continuous fit using cubic B-splines (or piecewise quintic) followed by time-parameterization under joint velocity/acceleration limits. The resulting trajectory remains in

Q_{\lim}

and is validated with dense collision checks before execution, reducing path length/curvature and improving tracking robustness without altering the planner interface.

Algorithm 1 Toggle–subgoal conditioned diffusion-guided RRT (phase-wise planning and execution)
Require: Access to current RGB image $I_{curr}$ (camera/rendering); initial robot state $(q_{start}, g_{start})$ ; terminal region $Q_{term}$ ; RRT iteration limit $N_{rrt}$ ; uniform-proposal probability $p_{uni} > 0$ ; goal tolerance $τ_{goal}$ ; upper network $F_{ψ}$ ; diffusion sampler $D_{θ}$
1: $(q_{curr}, g_{curr}) \leftarrow (q_{start}, g_{start})$
2: while $q_{curr} \notin Q_{term}$ do
3: Obtain current RGB image $I_{curr}$
4: $(q_{toggle}, g_{toggle}, q_{goal}) \leftarrow F_{ψ} (I_{curr}, q_{curr}, g_{curr})$
▹ predict next toggle and subgoal for the current phase
5: Initialize tree $T = (V, E)$ with root $q_{curr}$ : $V \leftarrow {q_{curr}}, E \leftarrow \emptyset$
6: $q_{reach} \leftarrow q_{curr}$	▹ best node reached so far toward $q_{goal}$
7: for $k = 1$ to $N_{rrt}$ do
8: $use_learned \leftarrow (Uniform (0, 1) \geq p_{uni})$
9: if $use_learned$ then
10: $(u, s) \leftarrow D_{θ} (I_{curr}, q_{curr}, q_{goal}, q_{toggle}, g_{toggle}, TreeContext (T))$
▹ $q_{curr}$ is the phase start state
11: $q_{base} \leftarrow SelectBase (T, target = q_{goal}, strategy = nearest-to-target)$
12: else
13: $(u, s) \leftarrow$ UniformSegment()
14: $q_{base} \leftarrow SelectBase (T, none, strategy = rrt-default)$
15: end if
16: $q_{new} \leftarrow Π_{\lim} (q_{base} + s u)$
17: if $CollisionFree (q_{base}, q_{new})$ then
18: $V \leftarrow V \cup {q_{new}}$ ; $E \leftarrow E \cup {(q_{base}, q_{new})}$
19: if $d (q_{new}, q_{goal}) < d (q_{reach}, q_{goal})$ then
20: $q_{reach} \leftarrow q_{new}$
21: end if
22: end if
23: if $d (q_{reach}, q_{goal}) \leq τ_{goal}$ then	▹ phase goal reached
24: break
25: end if
26: end for
27: ${path}_{phase} \leftarrow$ ExtractPath(T, q_reach)
28: ${path}_{phase} \leftarrow$ Smooth(path_phase)
29: Execute ${path}_{phase}$ on the robot/simulator, toggling the gripper to $g_{toggle}$ when $q \approx q_{toggle}$
30: Update $(q_{curr}, g_{curr})$ from the final state of the executed trajectory
31: end while
32: return concatenation of all executed phase trajectories

4.4. Inference Procedure

Given access to an RGB camera that provides the current scene image and an initial robot state

(q_{start}, g_{start})

, inference proceeds in a phase-wise manner:

Initialize the current state as $(q_{curr}, g_{curr}) \leftarrow (q_{start}, g_{start})$ .
While $q_{curr}$ has not reached the terminal region $Q_{term}$ :
(a)
Upper-level prediction (current phase). Acquire the current RGB image $I_{curr}$ and query the upper network on the current observation to obtain

$(q_{toggle}, g_{toggle}, q_{goal}) = F_{ψ} (I_{curr}, q_{curr}, g_{curr}),$

where $q_{goal}$ is a joint-space subgoal (typically near the next toggle) that serves as the target for the upcoming phase.
(b)
Diffusion-guided RRT/RRT-Connect. Initialize an RRT/RRT-Connect tree T with root $q_{curr}$ and grow it toward $q_{goal}$ for up to $N_{rrt}$ expansions. At each expansion, obtain a learned segment $(u, s)$ from the diffusion sampler $D_{θ}$ conditioned on $(I_{curr}, q_{curr}, q_{goal}, q_{toggle}, g_{toggle})$ (and an optional tree summary), and with probability $p_{uni}$ replace it with a UniformSegment(). Each proposal is converted to $q_{new}$ via (4) and accepted only if CollisionFree holds. The phase terminates when a node v in the tree satisfies $d (v, q_{goal}) \leq τ_{goal}$ or the iteration limit $N_{rrt}$ is reached.
(c)
Path extraction and execution. Extract a path from $q_{curr}$ to the best-reached node near $q_{goal}$ , apply shortcut smoothing and spline-based time-parameterization under joint limits, and validate the resulting trajectory with dense collision checks. Execute this trajectory on the robot (or in simulation), toggling the gripper to $g_{toggle}$ when the trajectory passes near $q_{toggle}$ . The final state of the executed trajectory becomes the new $(q_{curr}, g_{curr})$ for the next phase.
The overall motion is obtained by concatenating the phase trajectories until $q_{curr} \in Q_{term}$ or a global time/iteration limit is exceeded.

5. Experimental Results

Our evaluations are based entirely on simulations of a single contact-rich benchmark—uprighting and placing a mug—chosen to stress correct grasp/release timing. Our primary outcome is success rate. In addition, we report (i) per proposal inference time for the learned samplers, (ii) episode-level time-to-first-solution (TTFS), defined as the total wall-clock planning time accumulated over all short planning phases in an episode from the initial state until a trajectory reaching the terminal region

Q_{term}

is found (planner overhead included), and (iii) an efficiency comparison within RRT-style planners using the number of RRT expansions (“RRT steps”) normalized to our method. Unless noted otherwise, all search-based methods use identical random seeds, iteration limits, nearest-neighbor metrics, steering procedures, collision checking, and post-processing (shortcut smoothing plus spline time-parameterization).

5.1. Simulation Setup and Data Collection

Environment. All experiments are conducted in PyBullet (v3.2.7) using a 7-DOF Franka Emika Panda arm equipped with a parallel-jaw gripper. We implement all deep learning components in PyTorch (v2.9.1). RGB images are rendered from a single fixed eye-in-hand–style camera and resized to

88 \times 136

pixels.

Task. We consider a mug manipulation task with two contact events: (1) From a side-lying pose, grasp near the rim (grasp₁), upright and place (release₁); (2) Re-approach, grasp the handle (grasp₂), transport, and place the mug inside a target box (release₂). This task stresses correct contact reasoning; grasping inappropriate regions typically yields slip or collision.

We deliberately design the benchmark to involve two distinct contact modes: an initial rim grasp used to establish the upright position of the mug and stabilize it, followed by a handle grasp for transport and final placement. The second stage is only attempted if the first rim grasp and uprighting succeed, making the task a sequential, contact-rich benchmark rather than a single pick-and-place.

Demonstrations and labels. Expert rollouts are generated with bidirectional RRT (biRRT) plus scripted grasp heuristics in joint space. From each rollout, we record

(I_{0 : T}, q_{0 : T}, g_{0 : T})

, annotate the four toggle events (grasp₁, release₁, grasp₂, release₂), and extract approach the subgoals used as intermediate waypoints. A total of 2000 rollouts are used for training; all data are collected via simulation.

Subgoals

q_{goal}^{★}

are chosen as pre-contact approach waypoints along the expert trajectory. If such a waypoint is already very close (within a small joint-space tolerance) to

q_{toggle}^{★}

, we simply set

q_{goal}^{★} = q_{toggle}^{★}

for that phase.

Network I/O parameterization. Joint angles are normalized (radians) to

[- 1, 1]

. When an orientation target is needed, we adopt the continuous 6D rotation representation of Zhou et al. [22] to avoid discontinuities. (joint variables remain scalars; the 6D representation is used only for

SO (3)

targets/features).

Training/inference hyperparameters. The upper network (ResNet backbone + MLP heads) is trained with Equation (1) using loss weights

w_{q} = 2.0

,

w_{g} = 1.0

,

w_{s} = 1.0

,

w_{r} = 10^{- 4}

, a Huber threshold

κ = 2.0

, and regularization coefficients

λ_{\lim} = 1.0

and

λ_{prox} = 0.5

. The lower sampler is a conditional diffusion model (1D U-Net) trained with Equation (5). For diffusion, training uses

T = 100

noise steps with a cosine schedule; evaluation uses a deterministic DDIM sampler [23] with

T = 25

steps (and

T = 10

in the ablations). Unless otherwise noted, the RRT mixture uses

p_{uni} = 0.2

(uniform-segment probability), and the step length is clipped to

s \in [0.05, 0.2]

before forming proposals via (4).

5.2. Qualitative Overview of the Proposed Method

Figure 2 shows two major phases (grasp and release) and illustrates how the learned sample endpoints (translucent blue dots) concentrate along task-relevant approach corridors while the tree grows toward the predicted subgoal. Building on this, Figure 3 walks through a single episode from the first grasp to final placement, showing how the phase-wise receding-horizon updates keep

q_{goal}^{pred}

close to the currently relevant toggle for that phase. Finally, Figure 4 demonstrates consistency across randomized scenes: in three different initializations, the learned proposal distribution remains phase-appropriate (rim → release → handle → terminal region) while adapting to pose variations.

5.3. Evaluation Protocol and Metrics

Each episode randomizes the mug SE(2) pose on the table (with small roll/pitch); the arm starts from a fixed home pose with an open gripper. Unless otherwise noted, each comparison is evaluated over 300 randomized episodes. We consider two complementary comparisons, each with its own metric and baselines:

Comparison 1 (Primary): Success rate. End-to-end imitation policies (no explicit search or collision checking) versus our search-based planner.
Comparison 2 (Efficiency within the RRT family): normalized RRT expansions. RRT-style planners that share the same collision checker are compared by the number of expansions relative to ours, along with associated runtime metrics (per-proposal time and episode-level TTFS).

Unless otherwise noted, all RRT-family planners (ours, uniform RRT, goal-biased RRT, CVAE, conditional GAN) use the same per-phase iteration limit

N_{rrt}

. This budget was chosen via preliminary sweeps to be large enough that success rates have essentially saturated on our mug benchmark; increasing

N_{rrt}

further mainly increases computation while providing diminishing returns in success. As in standard sampling-based motion planning,

N_{rrt}

therefore controls the usual trade-off between planning reliability and computational cost, but our comparative conclusions are drawn under a fixed, shared

N_{rrt}

across all methods.

5.4. Comparison 1: Success Rate vs. Imitation Policies

We compare our search-based planner to three imitation (no-search) policies evaluated only by success rate: (i) A Diffusion Policy baseline [11] adapted to our setting (inputs

(I, q_{curr}, g_{curr})

), (ii) An LSTM–GMM policy following Mandlekar et al. [24], and (iii) a Transformer-based policy [25]. All imitation policies take the same inputs

(I, q_{curr}, g_{curr})

and predict a short 16-step action rollout. At test time, we execute the first eight predicted steps and then re-query the policy in a receding-window fashion until termination; this preserves closed-loop operation and mitigates long-horizon open-loop drift across policies. Because these methods do not expose RRT expansions or explicit collision checking, success rate is the only fair metric here.

Our implementation of Diffusion Policy adheres to the public codebase as closely as possible (network architecture, loss, and training procedure), but uses a single RGB camera matching our simulation setup rather than the multi-camera configuration in the original work. Consequently, the absolute performance numbers are not directly comparable to those reported in [11], but this baseline still represents a strong, modern diffusion-based imitation policy for our mug manipulation benchmark. The LSTM–GMM and Transformer baselines are re-tuned for our task to avoid strawman comparisons.

Table 1 reports results over 300 episodes. Our method achieves the highest success rate by combining (a) a learned, contact-aware sampling distribution with (b) explicit collision checking inside RRT. Consequently, unless the upper network provides severely misleading guidance, the planner remains robust to moderate perception errors and local misproposals. Among the imitation baselines, the Diffusion Policy is competitive but still trails our planner; the LSTM–GMM and Transformer policies underperform due to error accumulation across phases that require sequential contact events (rim grasp → upright release → handle grasp → final placement). In such settings, pure imitation struggles to recover from compounding deviations, whereas our planner’s search component provides corrective exploration and safety via collision checks.

5.5. Comparison 2: Normalized RRT Expansions Within the RRT Family

We now restrict attention to planners that share the same SBMP backbone and collision checker and report normalized RRT expansions (number of tree expansions) relative to ours (To factor out execution differences, normalization is computed on successful episodes only.). In all comparisons, we replace only the lower planner while keeping the upper network fixed. Baselines include (i) Uniform RRT. (ii) Goal-biased RRT that samples the subgoal

q_{goal}

with a probability of

0.5

. (iii) A CVAE-based sampler [9] following Ichter et al. [4], and (iv) a conditional GAN sampler [10]. All methods use identical nearest-neighbor metrics, steering, collision checking, iteration limits, goal tolerance

τ_{goal}

, and the same smoothing/time-parameterization pipeline. Results (500 trials; averages taken over successful episodes) are reported in Table 2.

The learned samplers (CVAE, conditional GAN, and ours) receive the same conditioning signals and mixture probability,

cond = (I, q_{curr}, q_{goal}, q_{toggle}, g_{toggle})

with identical

p_{uni}

. Hence, differences arise due to the quality of the learned proposal distribution, rather than due to differing inputs or mixture weights.

Because diffusion denoising requires multiple steps, the learned proposal generators have different per-proposal runtimes across model families. We therefore report the average inference time per proposal and a normalized time with DDIM (

T = 25

steps) set to

1.00

. Timings are measured with batch size 1 using PyTorch in FP32 and exclude collision checking, nearest-neighbor, and other planner overheads. The results are averaged over 500 proposal evaluations per method under identical hardware/software settings. As shown in Table 3, reducing DDIM steps (e.g., from

T = 25

to

T = 10

) provides an approximately linear speedup, while feed-forward generators (CVAE/conditional GAN) are substantially faster per proposal. In practice, for deployment scenarios where per-proposal latency is critical, a shorter schedule such as DDIM with

T = 10

can be a reasonable trade-off.

To assess end-to-end planning latency, we report an episode-level time-to-first-solution (TTFS), defined as the total wall-clock planning time accumulated over all planning phases in an episode until a feasible trajectory that reaches the terminal region

Q_{term}

is found (Table 4). TTFS includes planner overhead (nearest-neighbor search, collision checking, etc.) and is averaged over 500 successful episodes. For each

p_{uni}

setting, values are normalized independently, with DDIM (

T = 25

) taken as the 1.00 baseline.

While diffusion-based sampling improves search efficiency in terms of normalized RRT expansions, it incurs a higher absolute per-proposal cost due to its multi-step denoising process. Consequently, the end-to-end, episode-level time to first solution (TTFS) may be worse than that of feed-forward samplers under identical settings. Optimizing wall-clock latency is outside the main scope of this work; however, we observe that simple schedule reductions (e.g., DDIM with fewer steps, such as

T = 10

) yield an approximately linear speedup with only modest degradation in planning performance. This provides a practical tuning parameter when per-proposal latency is critical.

All of the experiments in this work are conducted via simulation, and we do not claim hard real-time guarantees on any specific robotic platform. As shown in Table 3 and Table 4, the diffusion-based sampler incurs a higher per-proposal cost than feed-forward samplers, so the method is not intended for high-frequency inner-loop control (e.g., 10–100 Hz torque or velocity control). Instead, our framework targets task-level or episodic planning, where a motion plan for a multi-second manipulation is computed and then executed by a lower-level controller. Once a trajectory is generated, execution only requires standard interpolation and tracking and does not involve running the diffusion model in the control loop. In this regime, planning latency on the order of seconds can be acceptable for static or slowly changing environments. A detailed, platform-specific real-time characterization on physical robot hardware is left for future work.

5.6. Qualitative Comparison of Learned Sample Distributions

Figure 5 contrasts the learned proposal distributions near contact-heavy scenes. In all panels, blue dots show only the endpoints proposed by each model’s learned sampler (candidate tree–extension targets; uniform samples are omitted). Our diffusion sampler concentrates mass along approach corridors to the rim and handle while preserving useful diversity, whereas CVAE/CGAN often allocate probability to task-irrelevant regions (marked with green dashed example arcs).

5.7. Robustness to Viewpoint and Illumination

While our main study is conducted via simulation, we assess the sensitivity of the toggle–subgoal predictor to camera pose and photometric shifts by perturbing the rendered image with (i) viewpoint jitter (yaw/pitch rotations and small per-axis translations) and (ii) photometric jitter (Table 5). For the latter, additive Gaussian noise with standard deviation

σ = 0.02

is applied to images whose pixel values are scaled to

[0, 1]

, followed by clamping back to

[0, 1]

.

We perturb the camera extrinsics with independent, zero-mean, per-axis jitter: yaw and pitch are sampled uniformly from

[- α, α]

degrees (with roll fixed to

0^{\circ}

for all tests), and translations along the camera

(x, y, z)

axes are sampled uniformly from

[- τ, τ]

meters. Unless otherwise noted, we set

(α, τ) = (5^{\circ}, 0.01)

for the mild setting and

(10^{\circ}, 0.02)

for the stronger setting. After updating extrinsics, we re-render the RGB image and feed it to the upper network. The planner keeps the same mixture setting (

p_{uni} = 0.2

) and re-queries the upper network in a receding-horizon manner.

Because the method estimates the toggle configuration from camera input, severe appearance changes can degrade the upper network’s predictions. In practice, periodic re-querying (receding horizon) together with a nonzero fraction of uniform samples helps us to recover from transient mispredictions by recentering the search after a few inner iterations.

We report both the success rate and the mean joint-space toggle error. Here, Toggle err denotes the mean joint-space distance (in radians) between the predicted toggle configuration

q_{toggle}^{pred}

and the nearest ground-truth toggle configuration (either grasp or release) for that episode.

Overall, the nominal setting achieves 96% success in this 50-episode study; success degrades to 88% under mild viewpoint jitter and 82% under the stronger jitter, with photometric noise alone yielding 90% success. These results indicate a graceful degradation of toggle–subgoal predictions under moderate viewpoint and illumination shifts, rather than catastrophic failure.

5.8. Limitations and Practical Considerations

Our evaluation is simulation-based. Deploying the method on real hardware also requires (i) calibrated sensing (e.g., camera/hand–eye), (ii) perception robustness to viewpoint and illumination changes, (iii) verified robot–scene geometry with enforced joint/velocity limits for collision checking, and (iv) profiling of the full planning loop to satisfy cycle-time and safety constraints. While our sampler is designed to be interface-compatible with the RRT/RRT-Connect sampling module, a thorough hardware study—covering calibration, safety checks, and latency budgeting (e.g., shorter DDIM schedules)—will be conducted in future work.

6. Conclusions

We presented a two-level, contact-aware sampling-based motion-planning framework for robotic manipulation. An upper network predicts a gripper toggle configuration, its mode (open/close), and a joint-space subgoal from a single RGB image and the current robot state. A lower planner then runs RRT/RRT-Connect with a conditional diffusion sampler that proposes tree-extension segments aligned with these contact cues, while mixing in a nonzero fraction of uniform samples to retain probabilistic completeness. This receding-horizon coupling injects grasp/release semantics into sampling without modifying standard nearest-neighbor, steering, or collision-checking primitives.

On a contact-rich mug benchmark, our method achieves higher success rates than imitation-learning baselines and requires fewer RRT expansions than classical and prior learning-guided samplers by concentrating proposals along contact-relevant approach manifolds. The resulting efficiency gains come at the cost of higher per-proposal compute due to diffusion denoising, but simple schedule reductions provide a practical latency–performance trade-off at test time.

Our study is limited to simulations, a single robot, and one manipulation family, and we do not yet provide hardware-level real-time guarantees. A real-robot evaluation and a more detailed, platform-specific latency characterization are therefore important directions for future work. Additional extensions include a transfer to hardware via domain randomization, scaling to multi-task and multi-object settings, incorporating uncertainty estimates to adapt the uniform mixture rate online, leveraging richer perceptual inputs (e.g., depth, multi-view, affordance cues), and extending the framework to asymptotic-optimal variants (e.g., RRT*) and dynamic scenes.

Author Contributions

Conceptualization, K.L. and K.C.; methodology, K.L. and K.C.; validation, K.L.; data curation, K.L. and K.C.; writing—original draft preparation, K.L.; writing—review and editing, K.C.; visualization, K.L.; supervision, K.C. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Institute of Information & Communications Technology Planning & Evaluation(IITP)-Innovative Human Resource Development for Local Intellectualization program grant funded by the Korea government(MSIT)(IITP-2025-RS-2023-00259678).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Part of the dataset is available upon request from the authors.

Conflicts of Interest

The authors declare no conflicts of interest.

References

LaValle, S.M.; Kuffner, J.J., Jr. Randomized kinodynamic planning. Int. J. Robot. Res. 2001, 20, 378–400. [Google Scholar] [CrossRef]
Kuffner, J.J.; LaValle, S.M. RRT-connect: An efficient approach to single-query path planning. In Proceedings of the 2000 IEEE International Conference on Robotics and Automation (ICRA), San Francisco, CA, USA, 24–28 April 2000; Volume 2, pp. 995–1001. [Google Scholar]
Karaman, S.; Frazzoli, E. Sampling-based algorithms for optimal motion planning. Int. J. Robot. Res. 2011, 30, 846–894. [Google Scholar] [CrossRef]
Ichter, B.; Harrison, J.; Pavone, M. Learning sampling distributions for robot motion planning. In Proceedings of the 2018 IEEE International Conference on Robotics and Automation (ICRA), Brisbane, QLD, Australia, 21–25 May 2018; pp. 7087–7094. [Google Scholar]
Qureshi, A.H.; Simeonov, A.; Bency, M.J.; Yip, M.C. Motion planning networks. In Proceedings of the 2019 International Conference on Robotics and Automation (ICRA), Montreal, QC, Canada, 20–24 May 2019; pp. 2118–2124. [Google Scholar]
Ho, J.; Jain, A.; Abbeel, P. Denoising diffusion probabilistic models. Adv. Neural Inf. Process. Syst. 2020, 33, 6840–6851. [Google Scholar]
Gammell, J.D.; Srinivasa, S.S.; Barfoot, T.D. Informed RRT*: Optimal sampling-based path planning focused via direct sampling of an admissible ellipsoidal heuristic. In Proceedings of the 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems, Chicago, IL, USA, 14–18 September 2014; pp. 2997–3004. [Google Scholar]
Huang, Z.; Chen, H.; Pohovey, J.; Driggs-Campbell, K. Neural informed rrt*: Learning-based path planning with point cloud state representations under admissible ellipsoidal constraints. In Proceedings of the 2024 IEEE International Conference on Robotics and Automation (ICRA), Yokohama, Japan, 13–17 May 2024; pp. 8742–8748. [Google Scholar]
Sohn, K.; Lee, H.; Yan, X. Learning structured output representation using deep conditional generative models. Adv. Neural Inf. Process. Syst. 2015, 28. [Google Scholar]
Mirza, M.; Osindero, S. Conditional generative adversarial nets. arXiv 2014, arXiv:1411.1784. [Google Scholar] [CrossRef]
Chi, C.; Xu, Z.; Feng, S.; Cousineau, E.; Du, Y.; Burchfiel, B.; Tedrake, R.; Song, S. Diffusion policy: Visuomotor policy learning via action diffusion. Int. J. Robot. Res. 2025, 44, 1684–1704. [Google Scholar] [CrossRef]
Janner, M.; Du, Y.; Tenenbaum, J.; Levine, S. Planning with Diffusion for Flexible Behavior Synthesis. In Proceedings of the International Conference on Machine Learning, Baltimore, MA, USA, 17–23 July 2022. [Google Scholar]
Carvalho, J.; Le, A.T.; Baierl, M.; Koert, D.; Peters, J. Motion planning diffusion: Learning and planning of robot motions with diffusion models. In Proceedings of the 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Detroit, MI, USA, 1–5 October 2023; pp. 1916–1923. [Google Scholar]
Pan, C.; Yi, Z.; Shi, G.; Qu, G. Model-based diffusion for trajectory optimization. Adv. Neural Inf. Process. Syst. 2024, 37, 57914–57943. [Google Scholar]
Mahler, J.; Liang, J.; Niyaz, S.; Laskey, M.; Doan, R.; Liu, X.; Ojea, J.A.; Goldberg, K. Dex-Net 2.0: Deep Learning to Plan Robust Grasps with Synthetic Point Clouds and Analytic Grasp Metrics. In Proceedings of the Robotics: Science and Systems (RSS), Cambridge, MA, USA, 12–14 July 2017. [Google Scholar]
Ten Pas, A.; Gualtieri, M.; Saenko, K.; Platt, R. Grasp pose detection in point clouds. Int. J. Robot. Res. 2017, 36, 1455–1473. [Google Scholar] [CrossRef]
Sundermeyer, M.; Mousavian, A.; Triebel, R.; Fox, D. Contact-graspnet: Efficient 6-dof grasp generation in cluttered scenes. In Proceedings of the 2021 IEEE International Conference on Robotics and Automation (ICRA), Xi’an, China, 30 May–5 June 2021; pp. 13438–13444. [Google Scholar]
Kaelbling, L.P.; Lozano-Pérez, T. Hierarchical task and motion planning in the now. In Proceedings of the 2011 IEEE International Conference on Robotics and Automation, Shanghai, China, 9–13 May 2011; pp. 1470–1477. [Google Scholar]
Garrett, C.R.; Lozano-Pérez, T.; Kaelbling, L.P. Pddlstream: Integrating symbolic planners and blackbox samplers via optimistic adaptive planning. In Proceedings of the International Conference on Automated Planning and Scheduling, Nancy, France, 14–19 June 2020; Volume 30, pp. 440–448. [Google Scholar]
Wirnshofer, F.; Schmitt, P.S.; Meister, P.; Wichert, G.v.; Burgard, W. State estimation in contact-rich manipulation. In Proceedings of the 2019 International Conference on Robotics and Automation (ICRA), Montreal, QC, Canada, 20–24 May 2019; pp. 3790–3796. [Google Scholar]
Beltran-Hernandez, C.C.; Petit, D.; Ramirez-Alpizar, I.G.; Nishi, T.; Kikuchi, S.; Matsubara, T.; Harada, K. Learning force control for contact-rich manipulation tasks with rigid position-controlled robots. IEEE Robot. Autom. Lett. 2020, 5, 5709–5716. [Google Scholar] [CrossRef]
Zhou, Y.; Barnes, C.; Lu, J.; Yang, J.; Li, H. On the continuity of rotation representations in neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 5745–5753. [Google Scholar]
Song, J.; Meng, C.; Ermon, S. Denoising Diffusion Implicit Models. In Proceedings of the International Conference on Learning Representations, Virtual, 3–7 May 2021. [Google Scholar]
Mandlekar, A.; Xu, D.; Wong, J.; Nasiriany, S.; Wang, C.; Kulkarni, R.; Li, F.; Savarese, S.; Zhu, Y.; Martín-Martín, R. What Matters in Learning from Offline Human Demonstrations for Robot Manipulation. In Proceedings of the 5th Conference on Robot Learning (CoRL) PMLR, London, UK, 8–11 November 2021. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar]

Figure 1. Framework overview. The upper network (ResNet + heads) takes an RGB image and the current robot state

(I, q_{curr}, g_{curr})

and predicts a toggle configuration

q_{toggle}

, toggle mode

g_{toggle}

, and a joint-space subgoal

q_{goal}

. The lower planner runs RRT/RRT-Connect in joint space; its sampling step is replaced and augmented by a conditional diffusion sampler that denoises a latent

x_{T} \to \dots \to x_{0}

under conditions

(I, q_{curr}, q_{goal}, q_{toggle}, g_{toggle})

and decodes to an extension segment

(u, s)

(with u normalized and s clipped). A nonzero fraction of uniform segments is mixed in to retain probabilistic completeness. In each short planning phase, the planner grows a tree toward

q_{goal}

, extracts and smooths a path segment, executes it on the robot (toggling the gripper near

q_{toggle}

when appropriate), and then re-queries the upper network at the new state, yielding a phase-wise receding-horizon behavior along the sequence of contact events.

Figure 1. Framework overview. The upper network (ResNet + heads) takes an RGB image and the current robot state

(I, q_{curr}, g_{curr})

and predicts a toggle configuration

q_{toggle}

, toggle mode

g_{toggle}

, and a joint-space subgoal

q_{goal}

. The lower planner runs RRT/RRT-Connect in joint space; its sampling step is replaced and augmented by a conditional diffusion sampler that denoises a latent

x_{T} \to \dots \to x_{0}

under conditions

(I, q_{curr}, q_{goal}, q_{toggle}, g_{toggle})

and decodes to an extension segment

(u, s)

(with u normalized and s clipped). A nonzero fraction of uniform segments is mixed in to retain probabilistic completeness. In each short planning phase, the planner grows a tree toward

q_{goal}

, extracts and smooths a path segment, executes it on the robot (toggling the gripper near

q_{toggle}

when appropriate), and then re-queries the upper network at the new state, yielding a phase-wise receding-horizon behavior along the sequence of contact events.

Figure 2. Stages of the proposed planner with toggle/subgoal guidance. (a) Move toward the rim grasp toggle point (toggle in green; subgoal shown as a blue cross). (b) After grasping the rim, move toward the release toggle point (toggle in magenta). Translucent blue dots depict only the learned sample distribution (endpoints of candidate tree–extension segments).

Figure 3. Single episode with toggle–subgoal guidance. Panels (a–j) track the sequence from the first grasp to final placement; blue dots show only learned proposal samples (uniform proposals omitted). (a) Move toward the first grasp; the upper network predicts a rim toggle

q_{toggle}^{pred}

(green) and a subgoal

q_{goal}^{pred}

. (b) Similar to (a), with

q_{goal}^{pred}

placed closer to

q_{toggle}^{pred}

. (c) After the first grasp, begin lifting and reorienting the mug. (d) Move toward the release toggle; the upper network predicts

q_{toggle}^{pred}

with

g_{toggle}^{pred} = o p e n

(magenta), and here

q_{goal}^{pred} \approx q_{toggle}^{pred}

. (e) Release to upright the mug. (f) Re-approach to grasp the handle. (g) Similar to (f), with

q_{goal}^{pred}

placed nearer to the handle toggle to refine the approach. (h) Second grasp on the handle. (i) Move toward the terminal region

Q_{term}

. (j) Final placement inside

Q_{term}

(success).

Figure 3. Single episode with toggle–subgoal guidance. Panels (a–j) track the sequence from the first grasp to final placement; blue dots show only learned proposal samples (uniform proposals omitted). (a) Move toward the first grasp; the upper network predicts a rim toggle

q_{toggle}^{pred}

(green) and a subgoal

q_{goal}^{pred}

. (b) Similar to (a), with

q_{goal}^{pred}

placed closer to

q_{toggle}^{pred}

. (c) After the first grasp, begin lifting and reorienting the mug. (d) Move toward the release toggle; the upper network predicts

q_{toggle}^{pred}

with

g_{toggle}^{pred} = o p e n

(magenta), and here

q_{goal}^{pred} \approx q_{toggle}^{pred}

. (e) Release to upright the mug. (f) Re-approach to grasp the handle. (g) Similar to (f), with

q_{goal}^{pred}

placed nearer to the handle toggle to refine the approach. (h) Second grasp on the handle. (i) Move toward the terminal region

Q_{term}

. (j) Final placement inside

Q_{term}

(success).

Figure 4. Three episodes under different initial mug poses. Columns show task phases (pre-grasp(rim) → release → second grasp(handle) → terminal); rows are different randomized initializations. Blue dots show only learned proposal samples. Across scenes, the sampler stays phase-appropriate while adapting to pose and viewpoint.

Figure 5. Qualitative comparison of learned sample distributions. Columns (left → right): Diffusion model (ours), CVAE, Conditional GAN. Rows: distinct scenes/stages around contact. Blue = endpoints from each model’s learned distribution (uniform not shown). Green dashed = representative task-irrelevant samples common in CVAE/CGAN (e.g., wide joint-space sweeps that bypass the narrow approach corridors needed to reach the rim or handle).

Table 1. Comparison 1 (success over 300 episodes). Imitation learning (IL) baselines are no-search policies; to keep evaluation closed-loop and mitigate long-horizon open-loop drift, each policy executes only a short prefix of its predicted sequence before re-planning from new observations. All methods share the same perception setup and task specification.

Method	Success (%)
Diffusion Policy (IL)	89.33
LSTM–GMM (IL)	69.67
Transformer (IL)	60.33
Ours (Diffusion Sampler + Toggle/Subgoal)	95.67

Table 2. Comparison 2 (efficiency within the RRT family). Expansions are normalized to our method (lower is better). Values are averaged over successful episodes only. All learned samplers share identical conditioning signals and the same

p_{uni}

. Our approach achieves the lowest normalized expansion count under identical conditioning and mixture settings.

Table 2. Comparison 2 (efficiency within the RRT family). Expansions are normalized to our method (lower is better). Values are averaged over successful episodes only. All learned samplers share identical conditioning signals and the same

p_{uni}

. Our approach achieves the lowest normalized expansion count under identical conditioning and mixture settings.

RRT Family Method	Normalized RRT Expansions
Ours (Diffusion Sampler + Toggle/Subgoal)	1.00
Uniform RRT	2.75
Goal-biased RRT (0.5 to $q_{goal}$ )	1.73
CVAE Sampler + RRT	1.36
Cond. GAN Sampler + RRT	1.67

Table 3. Average inference time per learned proposal (planner overhead excluded). Times are normalized to DDIM

T = 25 = 1.00

and averaged over 500 proposals per method.

Table 3. Average inference time per learned proposal (planner overhead excluded). Times are normalized to DDIM

T = 25 = 1.00

and averaged over 500 proposals per method.

Method	Normalized Time
Ours (Diffusion, DDIM $T = 25$ )	1.00
Ours (Diffusion, DDIM $T = 10$ )	0.42
CVAE sampler	0.09
Conditional-GAN sampler	0.11

Table 4. End-to-end, episode-level TTFS. Sum of planning time across all phases (planner overhead included), averaged over 500 successful episodes. Each column is independently normalized so that DDIM (

T = 25

) equals 1.00; values are not comparable across columns.

Table 4. End-to-end, episode-level TTFS. Sum of planning time across all phases (planner overhead included), averaged over 500 successful episodes. Each column is independently normalized so that DDIM (

T = 25

) equals 1.00; values are not comparable across columns.

Method	Normalized TTFS (per $p_{uni}$ Setting)
Method	$p_{uni} = 0.2$	$p_{uni} = 0.6$
Ours (Diffusion, DDIM $T = 25$ )	1.00	1.00
Ours (Diffusion, DDIM $T = 10$ )	0.45	0.49
CVAE sampler	0.16	0.24
Conditional-GAN sampler	0.20	0.27

Table 5. Robustness to viewpoint and illumination (50 episodes per setting).

α

: per-axis yaw/pitch bound;

τ

: per-axis translation bound;

σ

: std. of additive image noise. Toggle err is the mean joint-space distance (in radians) between the predicted toggle configuration and the nearest ground-truth grasp/release toggle.

Table 5. Robustness to viewpoint and illumination (50 episodes per setting).

α

: per-axis yaw/pitch bound;

τ

: per-axis translation bound;

σ

: std. of additive image noise. Toggle err is the mean joint-space distance (in radians) between the predicted toggle configuration and the nearest ground-truth grasp/release toggle.

Perturbation	$α$ (deg)	$τ$ (m)	$σ$	Success (%)	Toggle Err (Rad)
Nominal	0	0.00	0.000	96	0.0112
Viewpoint (mild)	5	0.01	0.000	88	0.0201
Viewpoint (strong)	10	0.02	0.000	82	0.0285
Photometric	0	0.00	0.020	90	0.0197

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lee, K.; Cho, K. Contact-Aware Diffusion Sampling for RRT-Based Manipulation. Electronics 2025, 14, 4837. https://doi.org/10.3390/electronics14244837

AMA Style

Lee K, Cho K. Contact-Aware Diffusion Sampling for RRT-Based Manipulation. Electronics. 2025; 14(24):4837. https://doi.org/10.3390/electronics14244837

Chicago/Turabian Style

Lee, Kyoungho, and Kyunghoon Cho. 2025. "Contact-Aware Diffusion Sampling for RRT-Based Manipulation" Electronics 14, no. 24: 4837. https://doi.org/10.3390/electronics14244837

APA Style

Lee, K., & Cho, K. (2025). Contact-Aware Diffusion Sampling for RRT-Based Manipulation. Electronics, 14(24), 4837. https://doi.org/10.3390/electronics14244837

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Contact-Aware Diffusion Sampling for RRT-Based Manipulation

Abstract

1. Introduction

2. Related Work

3. Preliminaries

3.1. System Model and Notation

3.2. Geometric Motion-Planning Problem

3.3. RRT Primitives

3.4. Sensing and Task Assumptions

4. Proposed Method

4.1. Overview

4.2. Upper Network: Toggle–Subgoal Predictor

4.3. Lower Planner: Diffusion-Guided Segment Sampling

4.4. Inference Procedure

5. Experimental Results

5.1. Simulation Setup and Data Collection

5.2. Qualitative Overview of the Proposed Method

5.3. Evaluation Protocol and Metrics

5.4. Comparison 1: Success Rate vs. Imitation Policies

5.5. Comparison 2: Normalized RRT Expansions Within the RRT Family

5.6. Qualitative Comparison of Learned Sample Distributions

5.7. Robustness to Viewpoint and Illumination

5.8. Limitations and Practical Considerations

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI