DiT1dLnet: A Fast and Accurate Diffusion Model Structure Based on Robot Behavior Imitation

Liao, Jiaxin; He, Weiyuan; Yu, Qing; Chen, Fei

doi:10.3390/math14111785

Open AccessArticle

DiT1dLnet: A Fast and Accurate Diffusion Model Structure Based on Robot Behavior Imitation

¹

College of Computer Science & Technology, Qingdao University, Qingdao 266071, China

²

College of Computer Science and Software Engineering, Hohai University, Nanjing 211100, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Mathematics 2026, 14(11), 1785; https://doi.org/10.3390/math14111785

Submission received: 2 March 2026 / Revised: 14 May 2026 / Accepted: 18 May 2026 / Published: 22 May 2026

(This article belongs to the Special Issue Advances in Artificial Intelligence, Machine Learning and Optimization, 2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

A novel robot behavior generation method combining imitation learning with diffusion models elegantly addresses multi-modal action distributions, adapts to high-dimensional action spaces, and demonstrates impressive training stability. It significantly improves success rates across nine diverse tasks on three different robot simulation benchmarks, but comes with longer training times and slower inference speed. This paper proposes a novel architecture, DiT1dLnet, applied to DDPM for training and inference. DiT1dLnet improves accuracy across various robotic simulation tasks while accelerating training and inference speed by 50–100%. We benchmarked its performance on nine different tasks using three distinct robots.

Keywords:

imitation learning; diffusion model; robot; manipulation; neural network

MSC:

68T07

1. Introduction

Imitation learning (IL) is a supervised learning paradigm that enables agents to acquire policies directly from expert demonstrations [1,2,3]. Owing to its ability to avoid costly online exploration, IL has been widely applied in practical domains such as gaming [4], robotics [5], and autonomous driving [6,7]. Among existing IL approaches, behavior cloning has demonstrated strong effectiveness in relatively structured settings. However, when tasks involve multimodal action distributions, long-horizon dependencies, and high precision control requirements, conventional imitation learning methods often struggle to capture the full diversity and authenticity of expert behaviors [8,9,10,11]. This limitation is especially significant in robotic manipulation, where different valid trajectories may exist for the same task objective and where action quality directly influences task success.

To address these challenges, recent studies have introduced diffusion models into robot behavior generation and decision making, leading to a new class of diffusion-based imitation learning and offline policy learning methods, such as DiffusionPolicy [12], DiffusionBC [13], DiffuserLite [14], DiffusionVeteran [15], QGPO [16] and SfBC [17], etc. These methods show strong capability in modeling multimodal action distributions and improving policy expressiveness compared with traditional imitation learning and reinforcement learning approaches [18,19,20]. Diffusion-based policies can generate diverse action sequences through iterative denoising, which makes them especially suitable for robotic tasks requiring flexible decision-making under uncertainty. Nevertheless, these advantages are often obtained at the cost of increased training and inference time, which limits their practical deployment in robotic systems requiring efficient policy execution.

This trade-off between policy performance and computational efficiency motivates the present study. Existing CNN-based diffusion architectures, such as U-Net-style denoisers, provide strong local modeling ability and stable optimization, but they may suffer from structural redundancy and limited efficiency in modeling long-range temporal dependencies [12]. Transformer-based diffusion architectures, on the other hand, benefit from global receptive fields and superior sequence modeling ability [21] but may weaken local inductive biases that are important for fine-grained action refinement in robot control. Therefore, an important open question is whether a diffusion architecture can simultaneously preserve the accuracy advantages of CNN-based denoisers and exploit the global modeling efficiency of transformer-based designs. Solving this problem is highly relevant for robot imitation learning because real robotic systems require both accurate action generation and efficient inference under complex multimodal behavior distributions.

To this end, we propose DiT1dLnet, a novel diffusion model architecture for robot behavior imitation. The main contributions of this work can be summarized in three aspects. First, we introduce a hybrid denoising architecture that combines stacked 1D Diffusion Transformer (DiT) blocks [21] with a CNN-based decoder inspired by diffusion-policy-style U-Net structures [12], thereby integrating global temporal modeling with local action refinement. Second, we design a triple residual connection mechanism, which injects multi-scale action embeddings into different decoding stages to improve feature preservation and gradient propagation. Third, we retain FiLM-based observation conditioning throughout the network, enabling visual and state observations to effectively guide long-horizon action generation while maintaining stable conditional diffusion learning. In contrast to existing methods that rely solely on either CNN or transformer backbones, our method provides a more balanced architectural solution to the accuracy-efficiency trade-off in robotic imitation learning.

The purpose of this study is to develop a fast and accurate diffusion policy architecture for robot manipulation that improves both practical efficiency and control performance. To validate this objective, we evaluated DiT1dLnet on 9 task variants across 3 representative robotic benchmarks, including Robomimic [5], Push-T [9], and Franka Kitchen [22]. These benchmarks were selected to cover complementary aspects of robotic manipulation. Robomimic focuses on grasping, fine-grained manipulation, long-horizon multi-stage execution, and multi-robot coordination, while also evaluating robustness to heterogeneous and suboptimal human demonstrations. Push-T tests the ability to handle complex, contact-rich object dynamics through precise point-contact pushing of a T-shaped block. Franka Kitchen evaluates multi-task coordination and long-horizon behavior synthesis, with the objective of completing as many demonstrated tasks as possible, regardless of order, thereby capturing both short-horizon and long-horizon multimodality.

Overall, this study suggests that architectural design plays an important role in diffusion-based robot policy learning. By combining transformer-based global sequence modeling, CNN-based local refinement, and enhanced residual information flow, DiT1dLnet provides an efficient and accurate solution for robot behavior imitation. More broadly, this work offers a useful example of how hybrid generative architectures can be designed for robotic control tasks.

2. Related Work

2.1. Diffusion Model

Diffusion models [23,24,25,26,27] are generative models that map Gaussian noise to some target distribution in an iterative fashion, optionally conditioned on some context [28,29].

A diffusion model is composed of two interconnected processes: the forward diffusion process and the reverse denoising process.

Forward Diffusion Process: Starting with a clean data sample

x_{0}

from the real distribution

q (x)

, the forward process

q

introduces Gaussian noise over

K

timesteps, following a variance schedule

\{β_{1}, \dots, β_{K}\}

. Each step is a Gaussian transition:

q (x_{k}∣ x_{k - 1}) = N (x_{k}; \sqrt{1 - β_{k}} x_{k - 1}, β_{k} I)

(1)

By defining

α_{k} = 1 - β_{k}

and

{\overset{ˉ}{α}}_{k} = \prod_{i = 1}^{k} α_{i}

, we can directly express

x_{k}

as a function of

x_{0}

:

x_{k} = \sqrt{{\overset{ˉ}{α}}_{k}} x_{0} + \sqrt{1 - {\overset{ˉ}{α}}_{k}} ϵ, ϵ \sim N (0, I)

(2)

This closed-form sampling is crucial for training the noise prediction network, as it allows the creation of noisy training pairs

(x_{k}, ϵ)

without iterative sampling.

Reverse Denoising Process: In contrast, the reverse process begins with a sample

x^{k}

drawn from a standard Gaussian distribution

x^{k} \sim N (0, I)

. The model learns to reverse the forward process by estimating the noise at each step, each a slightly denoised version of the previous, with

x^{0}

a ‘clean’ sample. The generation process, also known as the reverse process, starts with a sample

x^{k}

drawn from a standard Gaussian distribution. This sample is then iteratively refined over

K

steps. In each step

k

(from

K

down to

1

), the model denoises the current sample

x^{k}

to produce a less noisy version

x^{k - 1}

. This sequence of refinements,

x^{k}, x^{k - 1}, \dots, x 0

, continues until a final, noise-free output

x^{0}

is obtained. Each denoised step is governed by the following equation:

x^{k - 1} = α (x^{k} - γ ε_{θ} (x^{k}, k)) + N (0, σ^{2} I)

(3)

where

α

,

γ

, and

σ

denote coefficients determined by the diffusion discretization and variance schedule.

The noise prediction network, denoted as

ε_{θ}

, is trained to predict the noise added to a clean data sample. For each training step, we take a clean sample

x^{0}

from the dataset and select a random noise level

k

and Gaussian noise

ϵ \sim N (0, I)

. Unlike simple noise addition, we strictly follow the closed-form solution defined in Equation (2) to construct the noisy sample

x_{k}

. In this case,

x_{k}

is a linear combination of the original signal and noise, with the mixing ratio determined by the variance parameter

{\overset{ˉ}{α}}_{k}

. The network

ϵ_{θ}

takes this noisy sample

x_{k}

as input, aiming to restore the added noise

ϵ

as accurately as possible. The objective function for training typically uses Mean Squared Error (MSE) loss:

L = E_{x_{0}, ϵ, k} [∥ ϵ - ϵ_{θ} (x_{k}, k) ∥_{2}^{2}]

(4)

In strategy learning for robots, we replace the diffusion variable

x

in the image domain with the action (or action sequence)

A

. Here, the subscript denotes the environment interaction time step

τ

, and the superscript denotes the diffusion denoising step

t

. For example, at environment time step

τ

, the policy generates an action sequence

A_{τ} \in R^{H \times d}

of length

H

based on the observation

o_{τ}

(

d

is the action dimension). The diffusion process is carried out over the action sequence tensor, yielding a series of denoised intermediate variables

\{A_{τ}^{t}}_{t = T}^{0} = A_{τ}^{T}, A_{τ}^{T - 1}, \dots, A_{τ}^{1}, A_{τ}^{0}

. Note that the diffusion step

t

represents the denoising iteration index and is different from the environment time step

τ

. This action-domain formulation makes the diffusion process directly compatible with robot policy learning [30], since the model denoises action sequences conditioned on observations rather than image pixels.

Forward noising (action domain): Given a “clean action sequence”

A_{τ}^{0}

(corresponding to the expert action/supervision signal), a noised action is constructed at a randomly sampled diffusion step

t

:

A_{τ}^{t} = \sqrt{{\overset{ˉ}{α}}_{t}} A_{τ}^{0} + \sqrt{1 - {\overset{ˉ}{α}}_{t}} ϵ, ϵ \sim N (0, I)

(5)

where

ϵ

has the same shape as

A_{τ}^{0}

. In the proposed framework, the forward diffusion process is not a physical process executed by the robot. Instead, it serves as a training mechanism for constructing noisy action samples from expert action sequences. By learning to recover the clean action sequence from these corrupted samples, the model acquires the ability to generate feasible robot actions under observation guidance.

Conditional denoising and training objective: To incorporate observational context, the denoising network is modeled as a conditional model

ϵ_{θ} (A_{τ}^{t}, t, o_{τ})

. Its inputs are the noised action

A_{τ}^{t}

, the diffusion step

t

, and the observation condition

o_{τ}

, and its output is the estimated noise. Training uses the mean squared error of noise prediction:

L (θ) = E_{τ, A_{τ}^{0}, o_{τ}, t, ϵ} [{∥ ϵ - ϵ_{θ} (A_{τ}^{t}, t, o_{τ}) ∥}_{2}^{2}]

(6)

Therefore, the above action-domain diffusion formulation is not introduced as an isolated generative process, but as the mathematical foundation of the proposed observation-conditioned robot policy learning framework.

This objective trains the proposed DiT1dLnet to learn the reverse diffusion process, i.e., to iteratively remove noise from action samples and approximate the conditional action distribution for robot policy learning.

Inference/generation (action domain): During inference, conditioned on a given observation

o_{τ}

, the process starts from Gaussian noise initialized as

A_{τ}^{T} \sim N (0, I),

and iteratively denoises it to obtain the action sequence. First, the noise prediction is used to estimate the clean action:

{\hat{A}}_{τ}^{0} = \frac{A_{τ}^{t} - \sqrt{1 - {\overset{ˉ}{α}}_{t}} ϵ_{θ} (A_{τ}^{t}, t, o_{τ})}{\sqrt{{\overset{ˉ}{α}}_{t}}}

(7)

Then, following the discretized update of probability flow ODE [24]/DDIM [25], the process iterates from

t = T

down to

t = 1

, yielding the final action sequence

A_{τ}^{0}

(or its estimate). In closed-loop control, a receding-horizon execution scheme can be adopted: at each environment time step, an action chunk

{\hat{A}}_{τ}^{0}

is generated, only the first few actions in the predicted action chunk are executed, and then replanning is performed in a rolling manner to account for real-time feedback and maintain temporal consistency. In this way, the diffusion formulation is integrated into the full robot policy learning pipeline: the forward process provides noisy action training pairs, the proposed DiT1dLnet learns the reverse conditional denoising process for observation-guided action generation, and rolling replanning enables closed-loop execution under real-time feedback.

During inference, the iterative denoising process follows the discretized update as shown in Equation (7). Crucially, the noise prediction network

ϵ_{θ}

in this formulation represents our proposed DiT1dLnet. Therefore, the mathematical framework established in this section serves as the theoretical foundation for the action generation process of the entire DiT1dLnet architecture.

2.2. Representative Denoising Architectures

In this section, we delineate five representative denoising network architectures. These structures are summarized in Figure 1 and serve as the structural background for the baseline methods included in the experimental comparisons reported in Table 1. On this basis, the denoising architecture proposed in this work is presented separately in Figure 2.

In this section, neural network architectures for observation-to-action diffusion models are indicated. Concretely, at time step

t

, the policy takes the latest

T o

steps of observation data

O_{t}

as input and predicts

T_{p}

steps of action, of which

T_{a}

steps of action are executed on the robot without re-planning. Here, we define

T o

as the observation horizon,

T_{p}

as the action prediction horizon, and

T_{a}

as the action execution horizon. The following are the structural types of denoising networks.

2.2.1. MLP Sieve

This [13] uses three encoding networks to produce embeddings of the observation, denoising timestep, and action:

{o^{e}, t}^{e}, a_{τ - 1}^{e} \in R^{e m b e d d i m}

. These are concatenated together as input to a denoising network [

{o^{e}, t}^{e}, a_{τ - 1}^{e}

]:

[{o^{e}, t}^{e}, a_{τ - 2}^{e}] = L i n e a r (C a t ({o^{e}, t}^{e}, a_{τ - 1}^{e})) + R e s i d u a l (C a t ({o^{e}, t}^{e}, a_{τ - 1}^{e}))

(8)

Here, σ denotes instance normalization, and N denotes the number of multi-head attention heads.

The denoising network is a fully connected architecture, with residual skip connections, and with the raw denoising timestep

τ

and action

a_{t - 1}

repeatedly concatenated after each hidden layer. To include a longer observation history, previous observations are passed through the same embedding network, and embeddings are concatenated together.

This architecture uses separate encoders for the observation, diffusion timestep, and action, and concatenates these embeddings as input to a fully connected denoising network. As a representative denoising structure in the DiffusionBC framework, it provides a simple baseline for conditional action denoising. However, due to its limited capacity for global temporal modeling, it is less suitable for complex long-horizon robotic manipulation tasks than transformer- or CNN-based alternatives.

2.2.2. Transformer-Encoder Only

This approach generates embeddings following the same procedure as MLP Sieve. A multi-headed attention mechanism [32] (commonly employed in contemporary transformer encoder architectures) is subsequently utilized as the denoising network. A minimum of three tokens serve as input:

{o^{e}, t}^{e}, a_{τ - 1}^{e} \in R^{e m b e d d i m}

, and this configuration can be expanded to accommodate extended observation histories (given that only the current

t^{e}, a_{τ - 1}^{e}

are required due to the Markovian nature of the diffusion process):

[{o^{e}, t}^{e}, a_{τ - 1}^{e}] = A t t e n t i o n (Q_{I n p u t}, K_{I n p u t}, V_{I n p u t})

(9)

A t t e n t i o n (Q_{I n p u t}, K_{I n p u t}, V_{I n p u t}) = s o f t m a x (\frac{Q_{I n p u t} K_{i n p u t}^{T}}{\sqrt{d_{k}}}) V_{I n p u t}

(10)

Here, the input denotes [

{o^{e}, t}^{e}, a_{τ - 1}^{e}

], and

Q_{I n p u t}, K_{I n p u t}, V_{I n p u t}

denotes the project corresponding to [

{o^{e}, t}^{e}, a_{τ - 1}^{e}

].

This transformer-encoder structure serves as the denoising backbone in the ActionCT framework. By leveraging multi-head self-attention, it provides a global receptive field and is therefore well suited to robotic manipulation tasks that involve long-range temporal dependencies and complex correlations across action steps.

2.2.3. Time-Series Diffusion Transformer

A novel transformer-based DDPM that employs the transformer architecture from miniGPT [33] for action prediction. Actions with noise

A_{t}^{k}

are fed as input tokens to the transformer decoder blocks, with the sinusoidal embedding for diffusion iteration

k

prepended as the initial token.

A c t i o n S e q (A_{t + 1}) = c r o s s A t t e n t i o n (F i L m (o b s), A c t i o n S e q (A_{t}))

(11)

The observation

O t

is converted into observation embedding sequences through a shifted MLP, and these sequences are subsequently fed into the transformer decoder stack as input features. The gradient

ε θ

(

O t, A_{t}^{k}, k

) is estimated by each respective output token of the decoder stack.

This time-series diffusion transformer was introduced in DiffusionBC as a stronger alternative to the MLP-based denoiser. By treating noisy actions as sequential tokens and applying transformer decoder blocks, it improves global sequence modeling for action generation. However, because its design emphasizes temporal modeling more than local refinement, it may be less effective in manipulation tasks that require precise geometric control.

2.2.4. U-Net

The whole network uses U-Net [34] as the backbone architecture, adapting the classical encoder–decoder structure originally designed for image segmentation to robotic action generation tasks [35]. The encoder path progressively downsamples the input through convolutional blocks, while the decoder path symmetrically upsamples the features. Skip connections between encoder–decoder layers preserve spatial information that would otherwise be lost during the downsampling process. For robotic simulation applications, we modify the traditional U-Net by incorporating domain-specific components. Each feature is extracted through a ResidualBlock, where the block contains two convolutional layers with residual connections to mitigate vanishing gradients. To model the conditional distribution

p (A t | O t)

by conditioning the action generation process on observation features

O t

, we integrate Feature-wise Linear Modulation throughout the network. The FiLM mechanism applies element-wise affine transformations to intermediate feature maps, where scaling and shifting parameters are dynamically generated from observation embeddings. It receives the parameter pair

(γ, β)

generated from the observation feature

O_{t}

, and applies them to the intermediate feature map

h

:

FiLM (h; γ, β) = γ (O_{t}) ⊙ h + β (O_{t})

(12)

In our comparative study, this U-Net-based architecture serves as the denoising backbone for the DiffusionPolicy baseline. Its convolutional structure provides strong local inductive bias and stable denoising behavior, which are beneficial for robotic manipulation tasks requiring precise local control. At the same time, hierarchical CNN-based designs may introduce higher computational cost, motivating the use of a more streamlined hybrid architecture in this work.

2.2.5. DiT1d

The module inherits from the Diffusion Transformer (DiT) network backbone proposed in [21]. The DiT architecture uses transformer blocks that employ multi-head self-attention and adaptive layer normalization, where diffusion timestep and conditioning information are incorporated through adaptive normalization layers. In the context of robot policy learning, the noisy action sequence can be represented as a 1D token sequence, allowing the transformer to operate along the temporal dimension rather than over image patches.

Such a transformer-based denoising structure is advantageous for modeling dependencies across multiple action steps, since self-attention enables direct interaction among tokens throughout the sequence. This makes the architecture particularly suitable for observation-conditioned action denoising, where temporally correlated actions must be generated in a coherent manner. In our framework, the DiT1d module provides the global temporal modeling component of the denoising network, while the decoder preserves local refinement capability. Additional architectural details are introduced in Section 3.2.

3. Key Design

Based on the representative denoising structures summarized in Figure 1, we design the denoising architecture adopted in this work, as shown in Figure 2.

The first design decision is the choice of neural network architectures for

ϵ_{θ}

. In this work, we examine many common types of network architecture, MLP Sieve, Transformer-encoder Only, DiT1d, CNN-based diffusion policy (diffusion-policy) and time series diffusion Transformer and compare their performance and training characteristics.

3.1. DiT1dLnet

We propose a diffusion-based action denoising network for robot policy learning. First, we replace the encoder of the traditional U-Net with N stacked DiT Blocks, where observation features condition the action embedding through FiLM modulation. Second, the decoder consists of cascaded ChiResidualBlocks with progressively increasing feature dimensions (

d / 8 \to d / 4 \to d / 2 \to d

), where each stage contains N residual blocks. Third, we introduce Triple Residual connections as a key novelty: the input action sequence generates three auxiliary embeddings

E_{d / 8}, E_{d / 4}, E_{d / 2}

at different dimensional scales, which are added to the corresponding decoder stages via skip connections:

H_{0} = D i T B l o c k (A c t i o n S e q (A_{t}), F i L m (O b s S e q (O_{t})))

(13)

H_{1} = C h i R e s i d u a l B l o c k (H_{0}) + E_{d / 8}

(14)

H_{2} = C h i R e s i d u a l B l o c k (H_{1}) + E_{d / 4}

(15)

H_{3} = C h i R e s i d u a l B l o c k (H_{2}) + E_{d / 2}

(16)

\nabla A = C o n v 1 d (H_{3})

(17)

The final output is obtained through a linear projection: Equation (17), representing the action sequence after one-step denoising. The DiT Block structure with attention mechanism mitigates the excessive smoothing effect commonly observed in traditional DiffusionPolicies (UNet based) [23,26].

To address the high computational cost of traditional diffusion policies, DiT1dLnet is designed to minimize structural redundancy. This architectural streamlining is empirically validated in Section 8, where DiT1dLnet demonstrates a significantly lower parameter count and faster inference speed compared to hierarchical CNN-based designs while maintaining superior expressive power. In conventional 1D U-Net architectures, the parameter budget is heavily consumed by repetitive linear projection layers and high-dimensional convolutional kernels across multiple hierarchical downsampling stages. Specifically, each downsampling block in a typical Chi-Unet1d requires independent weight matrices (

W \in R^{d_{i n} \times d_{o u t}}

) to align feature dimensions at different scales.

By contrast, DiT1dLnet employs a “multi-to-one” integration strategy that consolidates these fragmented computations into a monolithic, high-capacity Transformer backbone. This approach replaces discrete stages of linear transformations with a global attention mechanism, allowing the model to leverage a Global Receptive Field to capture long-horizon temporal dependencies more effectively. Mathematically, the parameter complexity is simplified from a hierarchical summation to a single-level complexity:

O (\sum L_{i} \cdot d_{i}^{2}) \to O (L \cdot d_{m o d e l}^{2} + L^{2} \cdot d_{m o d e l})

(18)

This simplification helps explain the reduced model size and inference latency observed in our experiments.

3.2. DiT1d Module

The module serves as a core component of our new architecture, implementing a transformer-based block for sequential action token modeling. As illustrated in Figure 3, the module processes action sequences through a carefully designed dual-stage normalization strategy. The DiT1d module used in this work follows the DiT1d architecture illustrated in Figure 1 and is adapted for 1D action sequence modeling. Within the overall DiT1dLnet architecture shown in Figure 2, this module serves as the encoder to capture global temporal dependencies.

A key design choice in this module is the asymmetric FiLM conditioning strategy: the first transformer block applies both scale and shift operations for comprehensive feature modulation, while the second block employs only scaling transformations. This asymmetric design acts as a multiplicative gating mechanism; by omitting the “shift” in the second stage, we prevent the accumulation of additive biases during deep denoising iterations. This allows the model to adaptively weigh feature importance based on the visual context (

O b s

) without drifting the distribution mean, which is critical for maintaining the high-frequency details of precise action sequences. This design aims to balance model complexity and performance while maintaining computational efficiency.

3.3. ChiResidualBlock

This module is adapted from the CNN-based Diffusion Policy architecture proposed by [12]. The original DiffusionPolicy employs a 1D temporal CNN backbone with Feature-wise Linear Modulation (FiLM [31]) conditioning to model the conditional distribution

p (A t| O t)

by conditioning the action generation process on observation features

O t

. Our block inherits the core design principles from the Diffusion Policy’s CNN architecture, incorporating residual connections with FiLM conditioning at each convolutional layer. As shown in Figure 4, the module applies channel-wise FiLM conditioning (Scale + Shift operations) followed by 1D convolution layers.

However, as noted in the original work, the CNN-based backbone exhibits limitations when dealing with rapidly changing action sequences due to the inductive bias of temporal convolutions that favor low-frequency signals. In the DiT1dLnet framework, we strategically integrate the ChiResidualBlock as a decoder to complement the DiT1d encoder. While the Transformer-based encoder (Section 3.2) captures long-range global temporal dependencies, these CNN-based blocks provide strong local inductive biases and spatial stability during the upsampling process. By combining these two architectures through our proposed triple residual connections, we effectively leverage the “Global-to-Local” synergy, addressing high-frequency limitations while preserving the optimization stability of convolutional layers.

4. Simulation Environment and Dataset

4.1. Robomimic

Robomimic [5] is a comprehensive robotic manipulation benchmark specifically designed to evaluate imitation learning and offline reinforcement learning algorithms. This benchmark encompasses 5 distinct manipulation tasks, each accompanied by proficient human (PH) teleoperated demonstration datasets. Among these five tasks, as shown in Figure 5, they are Lift, Can, Tool Hang, Square, Transport. Additionally, it provides mixed proficient/non-proficient human (MH) demonstration datasets for 4 of the tasks, resulting in 9 task variants in total.

4.2. Push-T

Originally introduced in the Implicit Behavioral Cloning (IBC) framework [9], presents a challenging contact-rich manipulation task. The objective requires pushing a T-shaped block (colored gray) to a predetermined target location (marked in red) using a circular end-effector (colored blue). The task complexity is enhanced through randomized initial conditions for both the T-shaped block and the end-effector positioning. Success in this task demands a sophisticated understanding of contact dynamics, precise force application, and strategic planning to manipulate the object through point contacts. The task serves as an excellent benchmark for evaluating algorithms’ ability to handle non-prehensile manipulation and complex physical interactions.

4.3. Franka Kitchen

Franka Kitchen represents a widely adopted environment for assessing the capabilities of imitation learning and offline reinforcement learning methods in multi-task, long-horizon scenarios.

Originally proposed in the Relay Policy Learning study [22], this environment features a realistic kitchen setup containing 7 interactive objects including a microwave, kettle, light switch, slide cabinet, hinge cabinet, burner, and top burner. The benchmark includes a human demonstration dataset comprising 566 trajectories, where each demonstration involves completing 4 randomly selected tasks in arbitrary sequences. The primary evaluation criterion is the successful execution of as many demonstrated tasks as possible, regardless of their execution order. This setup effectively tests both short-horizon precision control and long-horizon multi-modal behavior synthesis, making it particularly valuable for evaluating policy generalization and task composition capabilities.

5. Evaluation Methodology

We conduct a comprehensive comparative analysis by evaluating the best-performing configurations of each baseline method across all benchmarks. The compared methods include BC-RNN, BET, IBC, DiffusionPolicy, DiffusionBC, and our proposed DiT1dLnet. The computing resources and evaluation metrics used in the experiment are presented in Appendix A.

Following the evaluation protocol commonly adopted in prior DiffusionPolicy [12], we use task-specific metrics rather than a single unified binary metric across all benchmarks. Specifically, success rate is used for Robomimic tasks, target area coverage is used for Push-T, and

p_{1} - p_{4}

are reported for Franka Kitchen, where

p_{i}

denotes the frequency of completing at least i subtasks within an episode. The experimental results are as shown in Table 2. We report two key performance indicators: LAST, the mean performance of the final checkpoint, and MAX, the maximum performance achieved among the last 10 checkpoints. Results are presented in the format (LAST/MAX) to reflect both final performance and peak capability.

To ensure statistical reliability and fair comparison, we adopt a rigorous evaluation protocol that aggregates results from multiple training runs and environmental initializations. Specifically, results are averaged over the last 10 model checkpoints, saved every 100,000 gradient steps during training. This strategy reduces the influence of training variance and provides a more stable estimate of policy performance. Each method is trained with 3 different random seeds to account for initialization sensitivity, and evaluation is conducted over 50 distinct environment initializations, resulting in approximately 1500 individual experiments in total across methods and tasks.

Figure 6. A visualization of the simulated Franka Kitchen environment (robot is hinging the cabinet), which requires the robot to coordinate multiple sub-tasks, such as hinging the cabinet, opening the microwave and moving the kettle in long-horizon sequences.

6. Experimental Analysis

As discussed in Section 2.2, the compared denoising backbones differ substantially in their ability to capture global temporal dependencies, preserve local geometric details, and maintain optimization stability. The results in Table 1 are broadly consistent with these architectural characteristics and provide empirical support for the structural analysis presented earlier. Overall, DiT1dLnet achieves the best average LAST score (0.910) among all compared methods, indicating stronger final policy quality and more stable convergence across diverse offline imitation learning tasks, while its average MAX score (0.940) remains highly competitive and close to that of DiffusionPolicy (0.946).

From the baseline comparisons, several clear trends can be observed. First, simpler backbones with limited temporal modeling capacity are less competitive on the more challenging tasks, suggesting that complex multi-modal action generation cannot be handled effectively by local or shallow representations alone. Second, transformer-based baselines such as ActionCT and DiffusionBC improve global sequence modeling, but their performance remains inconsistent on tasks that require fine-grained local control. For example, DiffusionBC shows relatively weak performance on square-ph, square-mh, transport-ph, and toolhang-ph, indicating that global temporal modeling alone is insufficient when precise geometric refinement and stable local denoising are also required. Third, the U-Net-based DiffusionPolicy remains one of the strongest baselines across the benchmark, which is consistent with the analysis in Section 2.2 that convolutional encoder–decoder architectures provide strong local inductive bias and stable denoising behavior for manipulation policies.

Against this background, the advantage of DiT1dLnet is particularly evident on tasks such as square-ph, square-mh, and toolhang-ph, where our method clearly outperforms the competing baselines. These tasks in the Robomimic benchmark are especially representative because they involve fine-grained object alignment, contact-sensitive interaction, and strict local geometric constraints, all of which require both temporally coherent action generation and accurate local refinement. A similar tendency can also be observed in Push-T, which emphasizes contact-rich non-prehensile manipulation, and in relay-kitchen, which evaluates long-horizon multi-stage task composition. The strong performance of DiT1dLnet on these benchmarks is in line with the design motivation introduced in Section 2.2: the DiT-based encoder is better suited to modeling long-range temporal dependencies across action sequences, while the CNN-based decoder preserves local refinement capability during denoising. In this sense, Table 1 further suggests that combining transformer-based global modeling with CNN-based local refinement is more effective than relying on either mechanism alone.

At the same time, Table 1 also reveals an important limitation of the proposed method. On transport-ph, DiT1dLnet does not outperform DiffusionPolicy, which achieves 0.842/0.880 compared with 0.443/0.611 for our method. This task is more challenging than the single-arm manipulation settings because it requires coordinated bimanual interaction, precise object handover, and tight cross-limb synchronization in a higher-dimensional action space. Under such conditions, the stronger local inductive bias of the U-Net-based baseline may remain more advantageous for maintaining strict local geometric consistency during two-arm coordination. Therefore, the results in Table 1 not only demonstrate the overall competitiveness of DiT1dLnet but also highlight the task conditions under which its architectural strengths are most apparent, as well as the scenarios that require further improvement. These observations are further discussed in Section 7 from the perspective of architectural efficiency and task-specific failure modes.

7. Ablation Study

We first evaluate the importance of combining global temporal modeling and local refinement by comparing three variants (as shown in Table 3): the full hybrid model, a DiT-only model, and a variant with a simplified decoder (denoted as Only CNN1d).

7.1. Effect of Hybrid Architecture

Only DiT: The DiT-only variant performs well on relatively simple tasks but degrades on more challenging manipulation settings such as toolhang-ph and transport-ph, suggesting that global temporal modeling alone is insufficient for precise local refinement.

This performance drop highlights the limitation of relying exclusively on global temporal modeling. Although the DiT block is effective at capturing long-range dependencies, it lacks the ability to perform fine-grained local refinement. Tasks such as toolhang-ph require precise geometric alignment and contact-sensitive control, where small local errors can lead to complete failure. Without a dedicated refinement mechanism, the model struggles to reconstruct high-frequency action details during the denoising process.

Only CNN1d: The Only CNN1d variant retains only the decoder-side structure while removing the DiT-based encoder. In practice, this configuration corresponds to an incomplete decoder structure without the full encoder–decoder interaction and multi-scale feature propagation found in the original design. As a result, its effective modeling behavior is significantly weakened.

As shown in Table 3, this variant achieves reasonable performance on simple tasks but performs poorly on complex scenarios such as transport-ph and toolhang-ph, where success rates are among the lowest across all variants.

This degradation can be attributed to two main factors. First, the absence of DiT blocks removes the model’s ability to capture long-range temporal dependencies, which are essential for coordinating multi-step actions and bimanual interactions. Second, relying only on a few shallow layers that behave almost like fully connected mappings (similar to the structure discussed in Section 2.2.1) is insufficient for capturing the temporal dependencies and structured patterns of action sequences.

Full Model: The full DiT1dLnet architecture consistently achieves the best performance across all tasks. This result demonstrates that the hybrid design effectively combines the strengths of both components: the DiT blocks provide global temporal modeling, while the CNN1d-based decoder contributes local refinement capability. It should be clarified that the latter conclusion is not derived solely from the ablation results in Table 3, but is instead supported jointly by the structural analysis of CNN/U-Net architectures in Section 2.2.4 and the experimental results of DiffusionPolicy in Table 1, where such a backbone demonstrates strong local inductive bias and fine-grained information preservation during denoising.

7.2. Effect of FiLM Conditioning (Shift Component)

We further investigate the role of the FiLM conditioning mechanism by comparing the default scale-only setting with the variant with FiLM shift. As shown in Table 3, introducing the shift component generally leads to lower success rates, indicating that adding shift does not improve performance in our architecture and may even reduce model stability in manipulation tasks.

A possible explanation is that the asymmetric FiLM design introduced in Section 3.2. In our design, the first stage originally uses both scale and shift, whereas the second stage keeps only scale in order to improve feature stability during iterative denoising. The shift term introduces additive modulation, which changes the mean of intermediate feature distributions. Although such additive modulation can, in principle, increase representational flexibility, it may also accumulate bias across denoising steps and disturb the stability of the generated action sequence.

By contrast, scale-only FiLM acts as a multiplicative gating mechanism, which is sufficient to modulate features according to the observation condition while avoiding unnecessary feature drift. This may help preserve high-frequency action details and maintain consistent geometric structure during denoising. For tasks requiring precise local alignment and stable contact dynamics, such as square-ph and toolhang-ph, this more stable conditioning strategy is particularly beneficial.

These results suggest that the shift component in FiLM is not necessary in our architecture. Instead, scale-only modulation provides a better trade-off between conditional expressiveness and denoising stability, which explains why the variant with FiLM shift shows reduced performance compared with the default setting.

7.3. Effect of Residual Connections

We analyze the impact of residual connections by comparing three configurations: no residual connections, single residual connections, and the proposed triple residual connections.

No Residual Connections: The variant without residual connections shows a significant drop in performance across multiple tasks, particularly in transport-ph and toolhang-ph. These results suggest that residual pathways are essential for maintaining effective information flow in the network.

Without residual connections, feature representations must be propagated solely through the main network path, which leads to information loss and weaker gradient flow. In complex manipulation tasks, where both global structure and local details are important, this limitation becomes especially severe.

Single Residual Connection: Introducing a single residual connection improves performance compared to the no-residual variant, indicating that shortcut pathways help mitigate information degradation. However, this improvement is still insufficient for complex tasks.

While single residual connections partially preserve feature information, they operate at a single scale and cannot fully support the multi-scale nature of the denoising process. As a result, performance gaps remain in tasks such as square-ph, transport-ph, and toolhang-ph.

Triple Residual Connections (Full Model): The proposed triple residual connection design, which is exactly the full model, achieves the best performance across all evaluated tasks. Unlike conventional residual designs, this mechanism injects action embeddings at multiple scales into different stages of the decoder.

This multi-scale residual structure plays a crucial role in preserving both low-level spatial details and high-level semantic information throughout the denoising process. It enhances feature alignment across layers, improves gradient propagation, and allows the network to reconstruct complex action sequences more effectively.

The performance gains are particularly significant in high-dimensional and precision-demanding tasks, demonstrating that multi-scale information flow is essential for robust robot manipulation. Therefore, the superiority of the full model can be directly attributed, in part, to the effectiveness of the proposed triple residual design.

Overall, the ablation results based on the MAX (max performance of the last 10 checkpoints, with each averaged over 3 seeds and 50 episodes) metric demonstrate that the performance of DiT1dLnet arises from the combined effect of multiple complementary components. The DiT blocks provide strong global temporal modeling, the decoder enables local refinement, the FiLM conditioning mechanism regulates observation-guided modulation, and the triple residual connections ensure stable and efficient information flow.

These findings confirm that no single component alone is sufficient to achieve optimal performance. Instead, the synergy between global modeling, local refinement, and multi-scale feature preservation is critical for solving complex robotic manipulation tasks.

8. Key Findings

Our results across Table 1, Table 2, Table 3 and Table 4 show that DiT1dLnet achieves competitive overall performance while maintaining a more efficient model structure than representative diffusion-policy baselines. In the main benchmark results, the proposed method performs strongly on a range of robotic manipulation tasks, particularly those involving fine-grained object interaction and long-horizon decision making. The ablation study further suggests that the performance of DiT1dLnet arises from the combination of multiple components rather than from any single design choice alone. In particular, the hybrid architecture, FiLM-based conditioning strategy, and multi-scale residual connections together contribute to stable denoising and effective action generation. To improve experimental transparency, all training processes and performance curves were recorded using Weights & Biases.

In addition to task performance, Table 4 shows that DiT1dLnet reduces model size and inference time compared with the chi_UNet1d-based DiffusionPolicy baseline, indicating that the proposed architecture offers a favorable balance between expressiveness and efficiency. At the same time, the results also reveal a limitation on the transport task, where the U-Net-based baseline remains stronger. This suggests that although the proposed hybrid design is effective for many manipulation settings, further improvement is still needed for highly coordinated bimanual interaction and stricter local geometric constraints. Overall, these findings are consistent with the main argument of this work: combining transformer-based global temporal modeling with CNN-based local refinement is a promising direction for efficient diffusion-based robot policy learning.

9. Discussion

Although DiT1dLnet achieves strong performance across multiple robotic manipulation benchmarks, several directions remain for future research. First, the current study is limited to simulation environments, and extending the proposed framework to real-world robotic platforms would be important for evaluating robustness under perception noise and sim-to-real gaps. Second, further improving sampling efficiency remains valuable, since diffusion-based policies still rely on iterative denoising during inference. Third, the relatively weaker performance on challenging bimanual tasks suggests that future work should enhance cross-limb coordination modeling and local geometric reasoning. Finally, combining the proposed architecture with offline reinforcement learning, online fine-tuning, or richer multi-modal sensory conditioning may further improve adaptability, scalability, and generalization.

10. Conclusions

In this work, we proposed DiT1dLnet, a hybrid diffusion architecture for robot imitation learning that combines transformer-based global temporal modeling with CNN-based local refinement. The model is designed to improve both action generation quality and computational efficiency in robotic manipulation tasks.

Experiments on Robomimic, Push-T, and Franka Kitchen show that DiT1dLnet achieves competitive or superior performance on most evaluated tasks, while also reducing model size and inference cost relative to representative diffusion-policy baselines. The ablation results further suggest that the effectiveness of the proposed method comes from the combined contribution of the DiT-based encoder, the CNN-based decoder, FiLM conditioning, and the multi-scale residual design.

Overall, these results indicate that hybrid global-local denoising architectures are a promising direction for efficient and accurate robot policy learning. At the same time, the weaker performance on challenging bimanual tasks suggests that further improvements are still needed in coordination modeling and local geometric reasoning. These properties make the proposed architecture a useful candidate for robotic manipulation settings that require both precise control and efficient inference.

Author Contributions

Conceptualization, J.L. and W.H.; methodology, J.L.; software, J.L.; validation, J.L., W.H. and F.C.; formal analysis, Q.Y.; investigation, Q.Y.; resources, J.L.; data curation, J.L.; writing—original draft preparation, J.L.; writing—review and editing, F.C.; visualization, W.H.; supervision, F.C.; project administration, F.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

This study utilizes publicly accessible anonymized datasets, including robomimic (https://robomimic.github.io, accessed on 11 April 2024), Push-T/Diffusion Policy (https://diffusion-policy.cs.columbia.edu, accessed on 14 March 2024), Franka Kitchen/Relay-Kitchen (https://relay-policy-learning.github.io/, accessed on 11 January 2024). These datasets are publicly available for academic use and were utilized in full compliance with their respective licenses.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Experimental Details

Appendix A.1. Computing Resources

RL experiments are conducted on a server equipped with an Intel(R) Xeon Gold 6426Y @ and 1 NVIDIA GeForce RTXA6000 GPUs. To ensure consistency, IL experiments are performed on the identical server setup.

Appendix A.2. Evaluation Metrics

In the D4RL benchmark, the scores are normalized to the range between 0 and 100, with expert-normalized scores = 100 × score × random_score expert_score-random_score. As for IL benchmarks, we report target area coverage as scores in the PushT benchmark and success rate in the Robomimic benchmark. In the Relay Kitchen environment, since most human demonstrations can only complete 4 subtasks, we denote the success rate of completing the i-th subtask as pi and report the average success rate as score = (p1 + p2 + p3 + p4)/4.

Unless stated otherwise, we utilize default hyperparameters from the official implementations for most algorithms and datasets. Configuration files and hyperparameters for each algorithm and environment are available in YAML format on our GitHub repository for reproducibility. I would like to express my sincere gratitude to Gymnasium [36] for providing the interactive experimental environment, which offered an intuitive and efficient platform for algorithm validation and analysis. I am also deeply grateful to Cleandiffuser [37] for providing the diffusion training framework, which laid a solid foundation for model implementation and parameter optimization in this work. Key hyperparameters for each offline IL algorithm are presented in Table A1 and Table A2.

Table A1. Hyperparameters for DiffusionPolicy and DiffusionBC in Low-Dim Tasks.

Hyperparameters	DiffusionPolicy		DiffusionBC	DiT1dLnet
Architecture	chi_UNet1d	chi_Transformer	DiT	DiT + Unet
Diffusion Model	DDPM	DDPM	DDPM	DDPM
Sampling Steps	5 (PushT)	5 (PushT)	50	5 (PushT)
Sampling Steps	50 (Otherwise)	50 (Otherwise)		50 (Otherwise)
Horizon	16	10	2	16
Obs Steps	2	2	2	2
Action Steps	8	8	1	8
Gradient Steps	10⁶	10⁶	10⁶	10⁶
Batch Size	256 (state based)	256 (state based)	512 (state based)	256 (state based)
Temperature	1.0	1.0	1.0	1.0
Learning Rate	10⁻⁴	10⁻⁴	10⁻³	10⁻⁴
Extra Sample Steps	N/A	N/A	8	N/A
Control Mode	Pos	Pos	Vel	Pos

Table A2. Hyperparameters for ACT in Low-Dim Tasks.

Hyperparameters	ACT
Architecture	Transformer-based
Learning Rate	10⁻⁵
Batch Size	256 (Low dim)
Encoder Layers	4
Decoder Layers	7
Feedforward Dimension	256
Hidden Dimension	256
Heads	8
Chunk size	16
Beta	10
Gradient Steps	10⁶
Control Mode	Vel (Kitchen)/Pos (Otherwise)

References

Ross, S.; Gordon, G.; Bagnell, D. A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics. JMLR Workshop and Conference Proceedings, Fort Lauderdale, FL, USA, 11–13 April 2011; pp. 627–635. [Google Scholar]
Pomerleau, D.A. Efficient training of artificial neural networks for autonomous navigation. Neural Comput. 1991, 3, 88–97. [Google Scholar] [CrossRef] [PubMed]
Torabi, F.; Warnell, G.; Stone, P. Behavioral cloning from observation. arXiv 2018, arXiv:1805.01954. [Google Scholar] [CrossRef]
Pearce, T.; Zhu, J. Counter-strike deathmatch with large-scale behavioural cloning. In 2022 IEEE Conference on Games (CoG); IEEE: Piscataway, NJ, USA, 2022; pp. 104–111. [Google Scholar]
Mandlekar, A.; Xu, D.; Wong, J.; Nasiriany, S.; Wang, C.; Kulkarni, R.; Fei-Fei, L.; Savarese, S.; Zhu, Y.; Martín-Martín, R. What matters in learning from offline human demonstrations for robot manipulation. arXiv 2021, arXiv:2108.03298. [Google Scholar] [CrossRef]
Hawke, J.; Shen, R.; Gurau, C.; Sharma, S.; Reda, D.; Nikolov, N.; Mazur, P.; Micklethwaite, S.; Griffiths, N.; Shah, A.; et al. Urban driving with conditional imitation learning. In 2020 IEEE International Conference on Robotics and Automation (ICRA); IEEE: Piscataway, NJ, USA, 2020. [Google Scholar]
Chen, H.; Lu, C.; Ying, C.; Su, H.; Zhu, J. Offline reinforcement learning via high-fidelity generative behavior modeling. arXiv 2022, arXiv:2209.14548. [Google Scholar]
Shafiullah, N.M.; Cui, Z.; Altanzaya, A.A.; Pinto, L. Behavior transformers: Cloning k modes with one stone. Adv. Neural Inf. Process. Syst. 2022, 35, 22955–22968. [Google Scholar]
Florence, P.; Lynch, C.; Zeng, A.; Ramirez, O.A.; Wahid, A.; Downs, L.; Wong, A.; Lee, J.; Mordatch, I.; Tompson, J. Implicit behavioral cloning. In Proceedings of the Conference on Robot Learning, London, UK, 8–11 November 2021; PMLR: New York, NY, USA, 2022; pp. 158–168. [Google Scholar]
Wu, J.; Sun, X.; Zeng, A.; Song, S.; Lee, J.; Rusinkiewicz, S.; Funkhouser, T. Spatial action maps for mobile manipulation. arXiv 2020, arXiv:2004.09141. [Google Scholar] [CrossRef]
Orsini, M.; Raichuk, A.; Hussenot, L.; Vincent, D.; Dadashi, R.; Girgin, S.; Geist, M.; Bachem, O.; Pietquin, O.; Andrychowicz, M. What matters for adversarial imitation learning? Adv. Neural Inf. Process. Syst. 2021, 34, 14656–14668. [Google Scholar]
Chi, C.; Xu, Z.; Feng, S.; Cousineau, E.; Du, Y.; Burchfiel, B.; Tedrake, R.; Song, S. Diffusion policy: Visuomotor policy learning via action diffusion. Int. J. Robot. Res. 2025, 44, 1684–1704. [Google Scholar] [CrossRef]
Pearce, T.; Rashid, T.; Kanervisto, A.; Bignell, D.; Sun, M.; Georgescu, R.; Macua, S.V.; Tan, S.Z.; Momennejad, I.; Hofmann, K.; et al. Imitating human behaviour with diffusion models. arXiv 2023, arXiv:2301.10677. [Google Scholar] [CrossRef]
Dong, Z.; Hao, J.; Yuan, Y.; Ni, F.; Wang, Y.; Li, P.; Zheng, Y. Diffuserlite: Towards real-time diffusion planning. Adv. Neural Inf. Process. Syst. 2024, 37, 122556–122583. [Google Scholar]
Lu, H.; Han, D.; Shen, Y.; Li, D. What makes a good diffusion planner for decision making? arXiv 2025, arXiv:2503.00535. [Google Scholar] [CrossRef]
Lu, C.; Chen, H.; Chen, J.; Su, H.; Li, C.; Zhu, J. Contrastive energy prediction for exact energy-guided diffusion sampling in offline reinforcement learning. In Proceedings of the International Conference on Machine Learning, Honolulu, HI, USA, 23–29 July 2023; PMLR: New York, NY, USA, 2023; pp. 22825–22855. [Google Scholar]
Ajay, A.; Du, Y.; Gupta, A.; Tenenbaum, J.; Jaakkola, T.; Agrawal, P. Is conditional generative modeling all you need for decision-making? arXiv 2022, arXiv:2211.15657. [Google Scholar]
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef]
Mnih, V.; Badia, A.P.; Mirza, M.; Graves, A.; Lillicrap, T.; Harley, T.; Silver, D.; Kavukcuoglu, K. Asynchronous methods for deep reinforcement learning. In Proceedings of the International Conference on Machine Learning, New York, NY, USA, 20–22 June 2016; PMLR: New York, NY, USA, 2016; pp. 1928–1937. [Google Scholar]
Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal policy optimization algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar] [CrossRef]
Peebles, W.; Xie, S. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 4195–4205. [Google Scholar]
Gupta, A.; Kumar, V.; Lynch, C.; Levine, S.; Hausman, K. Relay policy learning: Solving long-horizon tasks via imitation and reinforcement learning. arXiv 2019, arXiv:1910.11956. [Google Scholar] [CrossRef]
Ho, J.; Jain, A.; Abbeel, P. Denoising diffusion probabilistic models. Adv. Neural Inf. Process. Syst. 2020, 33, 6840–6851. [Google Scholar]
Song, Y.; Sohl-Dickstein, J.; Kingma, D.P.; Kumar, A.; Ermon, S.; Poole, B. Score-based generative modeling through stochastic differential equations. arXiv 2020, arXiv:2011.13456. [Google Scholar]
Song, J.; Meng, C.; Ermon, S. Denoising Diffusion Implicit Models. arXiv 2022, arXiv:2010.02502. [Google Scholar] [CrossRef]
Nichol, A.Q.; Dhariwal, P. Improved denoising diffusion probabilistic models. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; PMLR: New York, NY, USA, 2021; pp. 8162–8171. [Google Scholar]
Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–20 June 2022; pp. 10684–10695. [Google Scholar]
Ho, J.; Salimans, T. Classifier-free diffusion guidance. arXiv 2022, arXiv:2207.12598. [Google Scholar] [CrossRef]
Dhariwal, P.; Nichol, A. Diffusion models beat gans on image synthesis. Adv. Neural Inf. Process. Syst. 2021, 34, 8780–8794. [Google Scholar]
Dong, Z.; Yuan, Y.; Hao, J.; Ni, F.; Mu, Y.; Zheng, Y.; Hu, Y.; Lv, T.; Fan, C.; Hu, Z. Aligndiff: Aligning diverse human preferences via behavior-customisable diffusion model. arXiv 2023, arXiv:2310.02054. [Google Scholar]
Perez, E.; Strub, F.; De Vries, H.; Dumoulin, V.; Courville, A. Film: Visual reasoning with a general conditioning layer. Proc. AAAI Conf. Artif. Intell. 2018, 32, 3942–3951. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
Zhu, D.; Chen, J.; Shen, X.; Li, X.; Elhoseiny, M. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv 2023, arXiv:2304.10592. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention; Springer International Publishing: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar]
Janner, M.; Du, Y.; Tenenbaum, J.B.; Levine, S. Planning with diffusion for flexible behavior synthesis. arXiv 2022, arXiv:2205.09991. [Google Scholar] [CrossRef]
Towers, M.; Kwiatkowski, A.; Balis, J.; De Cola, G.; Deleu, T.; Goulão, M.; Andreas, K.; Krimmel, M.; KG, A.; Perez-Vicente, R.; et al. Gymnasium: A standard interface for reinforcement learning environments. Adv. Neural Inf. Process. Syst. 2026, 38. [Google Scholar]
Dong, Z.; Yuan, Y.; Hao, J.; Ni, F.; Ma, Y.; Li, P.; Zheng, Y. Cleandiffuser: An easy-to-use modularized library for diffusion models in decision making. Adv. Neural Inf. Process. Syst. 2024, 37, 86899–86926. [Google Scholar]

Figure 1. Visualization of implemented network architectures in different diffusion Models. The figure is intended to highlight the structural differences among representative backbone designs. These architectures differ in how they process noisy action inputs, incorporate observation conditioning, and model temporal dependencies in action sequences. The detailed formulation of these architectures is provided in Section 2.2, and the corresponding experimental comparison results are reported in Table 1.

Figure 2. Novelty network architectures in diffusion model: DiT1dLnet overview. General formulation. At time step

t

, the policy takes the latest

T_{o}

steps of observation data

O_{t}

as input and outputs

T_{a}

steps of actions

A_{t}

. In our network architecture, FiLM (Feature-wise Linear Modulation) [31] conditioning of the observation feature

O_{t}

is applied to every convolution layer, channel-wise. Starting from

A_{t}^{k}

drawn from Gaussian noise, the output of the noise prediction network

ε_{θ}

is subtracted, repeating

K

times to get

A_{t}^{0}

, the denoised action sequence.

Figure 2. Novelty network architectures in diffusion model: DiT1dLnet overview. General formulation. At time step

t

, the policy takes the latest

T_{o}

steps of observation data

O_{t}

as input and outputs

T_{a}

steps of actions

A_{t}

. In our network architecture, FiLM (Feature-wise Linear Modulation) [31] conditioning of the observation feature

O_{t}

is applied to every convolution layer, channel-wise. Starting from

A_{t}^{k}

drawn from Gaussian noise, the output of the noise prediction network

ε_{θ}

is subtracted, repeating

K

times to get

A_{t}^{0}

, the denoised action sequence.

Figure 3. Visualization of implemented DiTBlock. The module implements an asymmetric FiLM conditioning strategy where the first block applies both scale and shift for comprehensive modulation, while the second block utilizes only scaling to maintain feature stability.

Figure 4. Structure of the ChiResidualBlock (adapted from [12] and integrated into our hybrid decoder). This 1D-CNN-based block is integrated into our decoder to leverage its stability in processing temporal features while maintaining consistent gradient flow through residual connections.

Figure 5. Visualization of the six robotic manipulation tasks used in our evaluation: Lift, Can, Tool Hang, Square, Transport, and Push-T. These tasks cover a range of challenges including long-horizon sequences and precise object interaction.

Table 1. Evaluation results of offline IL benchmark. The metrics show the success rate for Robomimic and Relay-Kitchen, as well as the target area coverage for PushT. We report the mean performance of the last checkpoint, denoted as LAST, and max performance of the last 10 checkpoints (3 for image tasks), denoted as MAX, with each averaged over 3 seeds and 50 episodes. We will show the performance of (LAST/MAX).

Task Name	BC-RNN	ActionCT	DiffusionPolicy	DiffusionBC	DiT1dLnet (Ours)
low dim (state based)
pusht	0.591/0.700	0.990/1.000	0.994/1.000	0.990/0.990	1.000/1.000
relay-kitchen	0.750/0.790	0.724/0.761	0.990/1.000	0.811/0.892	1.000/1.000
lift-ph	0.963/1.000	0.983/1.000	1.000/1.000	0.990/1.000	1.000/1.000
lift-mh	0.933/1.000	0.981/1.000	1.000/1.000	0.921/1.000	1.000/1.000
can-ph	0.910/1.000	0.924/0.983	0.990/1.000	0.910/1.000	0.982/1.000
can-mh	0.811/1.000	0.811/1.000	0.992/1.000	0.772/0.885	0.990/1.000
square-ph	0.730/0.950	0.806/0.902	0.700/0.905	0.663/0.761	0.945/0.983
square-mh	0.598/0.864	0.463/0.724	0.623/0.811	0.427/0.520	0.835/0.903
transport-ph	0.473/0.761	0.642/0.851	0.842/0.880	0.172/0.341	0.443/0.611
toolhang-ph	0.315/0.677	0.642/0.820	0.724/0.864	0.153/0.365	0.806/0.902
Average	0.777/0.874	0.797/0.901	0.886/0.946	0.681/0.775	0.910/0.940

Table 2. Multi-stage task performance.

p_{x}

is the frequency of interacting with x or more objects (e.g., hinge cabinet as shown in Figure 6). Our DiT1dLnet performs better than the diffusion policy on all tasks, especially for difficult metrics such as

p 4

, as demonstrated by our results.

Table 2. Multi-stage task performance.

p_{x}

is the frequency of interacting with x or more objects (e.g., hinge cabinet as shown in Figure 6). Our DiT1dLnet performs better than the diffusion policy on all tasks, especially for difficult metrics such as

p 4

, as demonstrated by our results.

Method	$p_{1}$	$p_{2}$	$p_{3}$	$p_{4}$
BC-RNN	1.000	0.903	0.741	0.343
IBC	0.990	0.871	0.619	0.242
BET	0.990	0.933	0.715	0.447
DiffusionPolicy	1.000	1.000	1.000	0.990
DiffusionBC	1.000	0.983	0.761	0.552
DiT1dLnet (ours)	1.000	1.000	1.000	1.000

Table 3. DiT1dLnet ablation study. We report the max performance of the last 10 checkpoints, denoted as MAX, with each averaged over 3 seeds and 50 episodes. We will show the performance of MAX. task-specific metrics: success rate for Robomimic tasks, target area coverage for Push-T. ✓ and × Indicating the use and removal of the module.

Variant	DiT Block	CNN1d Decoder	FiLM	Residual Connections	Pusht	Kitchen	Lift-ph	Can-ph	Square-ph	Transport-ph	Toolhang-ph
Full model	✓	✓	✓	3	1.000	1.000	1.000	1.000	0.983	0.783	0.902
Only DiT	✓	×	✓	×	1.000	1.000	1.000	1.000	0.968	0.643	0.583
Only CNN1d	×	✓	✓	×	0.990	0.994	1.000	1.000	0.764	0.342	0.361
with FiLM shift	✓	✓	✓	3	1.000	0.998	1.000	1.000	0.958	0.744	0.873
Single Residual	✓	✓	✓	1	1.000	0.963	1.000	1.000	0.832	0.391	0.422
No residual	✓	✓	✓	×	1.000	0.712	1.000	1.000	0.827	0.327	0.371

Table 4. The Model Size and Inference Time in Low-Dim Lift-ph.

Algo	Model Size	Inference Time
DiffusionPolicy-ChiUnet1d	68.913	0.405
DiffusionBC-PearceMLP	0.834	0.062
DiT1dLnet (ours)	13.482	0.203

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Liao, J.; He, W.; Yu, Q.; Chen, F. DiT1dLnet: A Fast and Accurate Diffusion Model Structure Based on Robot Behavior Imitation. Mathematics 2026, 14, 1785. https://doi.org/10.3390/math14111785

AMA Style

Liao J, He W, Yu Q, Chen F. DiT1dLnet: A Fast and Accurate Diffusion Model Structure Based on Robot Behavior Imitation. Mathematics. 2026; 14(11):1785. https://doi.org/10.3390/math14111785

Chicago/Turabian Style

Liao, Jiaxin, Weiyuan He, Qing Yu, and Fei Chen. 2026. "DiT1dLnet: A Fast and Accurate Diffusion Model Structure Based on Robot Behavior Imitation" Mathematics 14, no. 11: 1785. https://doi.org/10.3390/math14111785

APA Style

Liao, J., He, W., Yu, Q., & Chen, F. (2026). DiT1dLnet: A Fast and Accurate Diffusion Model Structure Based on Robot Behavior Imitation. Mathematics, 14(11), 1785. https://doi.org/10.3390/math14111785

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

DiT1dLnet: A Fast and Accurate Diffusion Model Structure Based on Robot Behavior Imitation

Abstract

1. Introduction

2. Related Work

2.1. Diffusion Model

2.2. Representative Denoising Architectures

2.2.1. MLP Sieve

2.2.2. Transformer-Encoder Only

2.2.3. Time-Series Diffusion Transformer

2.2.4. U-Net

2.2.5. DiT1d

3. Key Design

3.1. DiT1dLnet

3.2. DiT1d Module

3.3. ChiResidualBlock

4. Simulation Environment and Dataset

4.1. Robomimic

4.2. Push-T

4.3. Franka Kitchen

5. Evaluation Methodology

6. Experimental Analysis

7. Ablation Study

7.1. Effect of Hybrid Architecture

7.2. Effect of FiLM Conditioning (Shift Component)

7.3. Effect of Residual Connections

8. Key Findings

9. Discussion

10. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A. Experimental Details

Appendix A.1. Computing Resources

Appendix A.2. Evaluation Metrics

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI