DiT1dLnet: A Fast and Accurate Diffusion Model Structure Based on Robot Behavior Imitation
Abstract
1. Introduction
2. Related Work
2.1. Diffusion Model
2.2. Representative Denoising Architectures
2.2.1. MLP Sieve
2.2.2. Transformer-Encoder Only
2.2.3. Time-Series Diffusion Transformer
2.2.4. U-Net
2.2.5. DiT1d
3. Key Design
3.1. DiT1dLnet
3.2. DiT1d Module
3.3. ChiResidualBlock
4. Simulation Environment and Dataset
4.1. Robomimic
4.2. Push-T
4.3. Franka Kitchen
5. Evaluation Methodology

6. Experimental Analysis
7. Ablation Study
7.1. Effect of Hybrid Architecture
7.2. Effect of FiLM Conditioning (Shift Component)
7.3. Effect of Residual Connections
8. Key Findings
9. Discussion
10. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
Appendix A. Experimental Details
Appendix A.1. Computing Resources
Appendix A.2. Evaluation Metrics
| Hyperparameters | DiffusionPolicy | DiffusionBC | DiT1dLnet | |
|---|---|---|---|---|
| Architecture | chi_UNet1d | chi_Transformer | DiT | DiT + Unet |
| Diffusion Model | DDPM | DDPM | DDPM | DDPM |
| Sampling Steps | 5 (PushT) | 5 (PushT) | 50 | 5 (PushT) |
| 50 (Otherwise) | 50 (Otherwise) | 50 (Otherwise) | ||
| Horizon | 16 | 10 | 2 | 16 |
| Obs Steps | 2 | 2 | 2 | 2 |
| Action Steps | 8 | 8 | 1 | 8 |
| Gradient Steps | 106 | 106 | 106 | 106 |
| Batch Size | 256 (state based) | 256 (state based) | 512 (state based) | 256 (state based) |
| Temperature | 1.0 | 1.0 | 1.0 | 1.0 |
| Learning Rate | 10−4 | 10−4 | 10−3 | 10−4 |
| Extra Sample Steps | N/A | N/A | 8 | N/A |
| Control Mode | Pos | Pos | Vel | Pos |
| Hyperparameters | ACT |
|---|---|
| Architecture | Transformer-based |
| Learning Rate | 10−5 |
| Batch Size | 256 (Low dim) |
| Encoder Layers | 4 |
| Decoder Layers | 7 |
| Feedforward Dimension | 256 |
| Hidden Dimension | 256 |
| Heads | 8 |
| Chunk size | 16 |
| Beta | 10 |
| Gradient Steps | 106 |
| Control Mode | Vel (Kitchen)/Pos (Otherwise) |
References
- Ross, S.; Gordon, G.; Bagnell, D. A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics. JMLR Workshop and Conference Proceedings, Fort Lauderdale, FL, USA, 11–13 April 2011; pp. 627–635. [Google Scholar]
- Pomerleau, D.A. Efficient training of artificial neural networks for autonomous navigation. Neural Comput. 1991, 3, 88–97. [Google Scholar] [CrossRef] [PubMed]
- Torabi, F.; Warnell, G.; Stone, P. Behavioral cloning from observation. arXiv 2018, arXiv:1805.01954. [Google Scholar] [CrossRef]
- Pearce, T.; Zhu, J. Counter-strike deathmatch with large-scale behavioural cloning. In 2022 IEEE Conference on Games (CoG); IEEE: Piscataway, NJ, USA, 2022; pp. 104–111. [Google Scholar]
- Mandlekar, A.; Xu, D.; Wong, J.; Nasiriany, S.; Wang, C.; Kulkarni, R.; Fei-Fei, L.; Savarese, S.; Zhu, Y.; Martín-Martín, R. What matters in learning from offline human demonstrations for robot manipulation. arXiv 2021, arXiv:2108.03298. [Google Scholar] [CrossRef]
- Hawke, J.; Shen, R.; Gurau, C.; Sharma, S.; Reda, D.; Nikolov, N.; Mazur, P.; Micklethwaite, S.; Griffiths, N.; Shah, A.; et al. Urban driving with conditional imitation learning. In 2020 IEEE International Conference on Robotics and Automation (ICRA); IEEE: Piscataway, NJ, USA, 2020. [Google Scholar]
- Chen, H.; Lu, C.; Ying, C.; Su, H.; Zhu, J. Offline reinforcement learning via high-fidelity generative behavior modeling. arXiv 2022, arXiv:2209.14548. [Google Scholar]
- Shafiullah, N.M.; Cui, Z.; Altanzaya, A.A.; Pinto, L. Behavior transformers: Cloning k modes with one stone. Adv. Neural Inf. Process. Syst. 2022, 35, 22955–22968. [Google Scholar]
- Florence, P.; Lynch, C.; Zeng, A.; Ramirez, O.A.; Wahid, A.; Downs, L.; Wong, A.; Lee, J.; Mordatch, I.; Tompson, J. Implicit behavioral cloning. In Proceedings of the Conference on Robot Learning, London, UK, 8–11 November 2021; PMLR: New York, NY, USA, 2022; pp. 158–168. [Google Scholar]
- Wu, J.; Sun, X.; Zeng, A.; Song, S.; Lee, J.; Rusinkiewicz, S.; Funkhouser, T. Spatial action maps for mobile manipulation. arXiv 2020, arXiv:2004.09141. [Google Scholar] [CrossRef]
- Orsini, M.; Raichuk, A.; Hussenot, L.; Vincent, D.; Dadashi, R.; Girgin, S.; Geist, M.; Bachem, O.; Pietquin, O.; Andrychowicz, M. What matters for adversarial imitation learning? Adv. Neural Inf. Process. Syst. 2021, 34, 14656–14668. [Google Scholar]
- Chi, C.; Xu, Z.; Feng, S.; Cousineau, E.; Du, Y.; Burchfiel, B.; Tedrake, R.; Song, S. Diffusion policy: Visuomotor policy learning via action diffusion. Int. J. Robot. Res. 2025, 44, 1684–1704. [Google Scholar] [CrossRef]
- Pearce, T.; Rashid, T.; Kanervisto, A.; Bignell, D.; Sun, M.; Georgescu, R.; Macua, S.V.; Tan, S.Z.; Momennejad, I.; Hofmann, K.; et al. Imitating human behaviour with diffusion models. arXiv 2023, arXiv:2301.10677. [Google Scholar] [CrossRef]
- Dong, Z.; Hao, J.; Yuan, Y.; Ni, F.; Wang, Y.; Li, P.; Zheng, Y. Diffuserlite: Towards real-time diffusion planning. Adv. Neural Inf. Process. Syst. 2024, 37, 122556–122583. [Google Scholar]
- Lu, H.; Han, D.; Shen, Y.; Li, D. What makes a good diffusion planner for decision making? arXiv 2025, arXiv:2503.00535. [Google Scholar] [CrossRef]
- Lu, C.; Chen, H.; Chen, J.; Su, H.; Li, C.; Zhu, J. Contrastive energy prediction for exact energy-guided diffusion sampling in offline reinforcement learning. In Proceedings of the International Conference on Machine Learning, Honolulu, HI, USA, 23–29 July 2023; PMLR: New York, NY, USA, 2023; pp. 22825–22855. [Google Scholar]
- Ajay, A.; Du, Y.; Gupta, A.; Tenenbaum, J.; Jaakkola, T.; Agrawal, P. Is conditional generative modeling all you need for decision-making? arXiv 2022, arXiv:2211.15657. [Google Scholar]
- Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef]
- Mnih, V.; Badia, A.P.; Mirza, M.; Graves, A.; Lillicrap, T.; Harley, T.; Silver, D.; Kavukcuoglu, K. Asynchronous methods for deep reinforcement learning. In Proceedings of the International Conference on Machine Learning, New York, NY, USA, 20–22 June 2016; PMLR: New York, NY, USA, 2016; pp. 1928–1937. [Google Scholar]
- Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal policy optimization algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar] [CrossRef]
- Peebles, W.; Xie, S. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 4195–4205. [Google Scholar]
- Gupta, A.; Kumar, V.; Lynch, C.; Levine, S.; Hausman, K. Relay policy learning: Solving long-horizon tasks via imitation and reinforcement learning. arXiv 2019, arXiv:1910.11956. [Google Scholar] [CrossRef]
- Ho, J.; Jain, A.; Abbeel, P. Denoising diffusion probabilistic models. Adv. Neural Inf. Process. Syst. 2020, 33, 6840–6851. [Google Scholar]
- Song, Y.; Sohl-Dickstein, J.; Kingma, D.P.; Kumar, A.; Ermon, S.; Poole, B. Score-based generative modeling through stochastic differential equations. arXiv 2020, arXiv:2011.13456. [Google Scholar]
- Song, J.; Meng, C.; Ermon, S. Denoising Diffusion Implicit Models. arXiv 2022, arXiv:2010.02502. [Google Scholar] [CrossRef]
- Nichol, A.Q.; Dhariwal, P. Improved denoising diffusion probabilistic models. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; PMLR: New York, NY, USA, 2021; pp. 8162–8171. [Google Scholar]
- Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–20 June 2022; pp. 10684–10695. [Google Scholar]
- Ho, J.; Salimans, T. Classifier-free diffusion guidance. arXiv 2022, arXiv:2207.12598. [Google Scholar] [CrossRef]
- Dhariwal, P.; Nichol, A. Diffusion models beat gans on image synthesis. Adv. Neural Inf. Process. Syst. 2021, 34, 8780–8794. [Google Scholar]
- Dong, Z.; Yuan, Y.; Hao, J.; Ni, F.; Mu, Y.; Zheng, Y.; Hu, Y.; Lv, T.; Fan, C.; Hu, Z. Aligndiff: Aligning diverse human preferences via behavior-customisable diffusion model. arXiv 2023, arXiv:2310.02054. [Google Scholar]
- Perez, E.; Strub, F.; De Vries, H.; Dumoulin, V.; Courville, A. Film: Visual reasoning with a general conditioning layer. Proc. AAAI Conf. Artif. Intell. 2018, 32, 3942–3951. [Google Scholar] [CrossRef]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
- Zhu, D.; Chen, J.; Shen, X.; Li, X.; Elhoseiny, M. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv 2023, arXiv:2304.10592. [Google Scholar]
- Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention; Springer International Publishing: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar]
- Janner, M.; Du, Y.; Tenenbaum, J.B.; Levine, S. Planning with diffusion for flexible behavior synthesis. arXiv 2022, arXiv:2205.09991. [Google Scholar] [CrossRef]
- Towers, M.; Kwiatkowski, A.; Balis, J.; De Cola, G.; Deleu, T.; Goulão, M.; Andreas, K.; Krimmel, M.; KG, A.; Perez-Vicente, R.; et al. Gymnasium: A standard interface for reinforcement learning environments. Adv. Neural Inf. Process. Syst. 2026, 38. [Google Scholar]
- Dong, Z.; Yuan, Y.; Hao, J.; Ni, F.; Ma, Y.; Li, P.; Zheng, Y. Cleandiffuser: An easy-to-use modularized library for diffusion models in decision making. Adv. Neural Inf. Process. Syst. 2024, 37, 86899–86926. [Google Scholar]





| Task Name | BC-RNN | ActionCT | DiffusionPolicy | DiffusionBC | DiT1dLnet (Ours) |
|---|---|---|---|---|---|
| low dim (state based) | |||||
| pusht | 0.591/0.700 | 0.990/1.000 | 0.994/1.000 | 0.990/0.990 | 1.000/1.000 |
| relay-kitchen | 0.750/0.790 | 0.724/0.761 | 0.990/1.000 | 0.811/0.892 | 1.000/1.000 |
| lift-ph | 0.963/1.000 | 0.983/1.000 | 1.000/1.000 | 0.990/1.000 | 1.000/1.000 |
| lift-mh | 0.933/1.000 | 0.981/1.000 | 1.000/1.000 | 0.921/1.000 | 1.000/1.000 |
| can-ph | 0.910/1.000 | 0.924/0.983 | 0.990/1.000 | 0.910/1.000 | 0.982/1.000 |
| can-mh | 0.811/1.000 | 0.811/1.000 | 0.992/1.000 | 0.772/0.885 | 0.990/1.000 |
| square-ph | 0.730/0.950 | 0.806/0.902 | 0.700/0.905 | 0.663/0.761 | 0.945/0.983 |
| square-mh | 0.598/0.864 | 0.463/0.724 | 0.623/0.811 | 0.427/0.520 | 0.835/0.903 |
| transport-ph | 0.473/0.761 | 0.642/0.851 | 0.842/0.880 | 0.172/0.341 | 0.443/0.611 |
| toolhang-ph | 0.315/0.677 | 0.642/0.820 | 0.724/0.864 | 0.153/0.365 | 0.806/0.902 |
| Average | 0.777/0.874 | 0.797/0.901 | 0.886/0.946 | 0.681/0.775 | 0.910/0.940 |
| Method | ||||
|---|---|---|---|---|
| BC-RNN | 1.000 | 0.903 | 0.741 | 0.343 |
| IBC | 0.990 | 0.871 | 0.619 | 0.242 |
| BET | 0.990 | 0.933 | 0.715 | 0.447 |
| DiffusionPolicy | 1.000 | 1.000 | 1.000 | 0.990 |
| DiffusionBC | 1.000 | 0.983 | 0.761 | 0.552 |
| DiT1dLnet (ours) | 1.000 | 1.000 | 1.000 | 1.000 |
| Variant | DiT Block | CNN1d Decoder | FiLM | Residual Connections | Pusht | Kitchen | Lift-ph | Can-ph | Square-ph | Transport-ph | Toolhang-ph |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Full model | ✓ | ✓ | ✓ | 3 | 1.000 | 1.000 | 1.000 | 1.000 | 0.983 | 0.783 | 0.902 |
| Only DiT | ✓ | × | ✓ | × | 1.000 | 1.000 | 1.000 | 1.000 | 0.968 | 0.643 | 0.583 |
| Only CNN1d | × | ✓ | ✓ | × | 0.990 | 0.994 | 1.000 | 1.000 | 0.764 | 0.342 | 0.361 |
| with FiLM shift | ✓ | ✓ | ✓ | 3 | 1.000 | 0.998 | 1.000 | 1.000 | 0.958 | 0.744 | 0.873 |
| Single Residual | ✓ | ✓ | ✓ | 1 | 1.000 | 0.963 | 1.000 | 1.000 | 0.832 | 0.391 | 0.422 |
| No residual | ✓ | ✓ | ✓ | × | 1.000 | 0.712 | 1.000 | 1.000 | 0.827 | 0.327 | 0.371 |
| Algo | Model Size | Inference Time |
|---|---|---|
| DiffusionPolicy-ChiUnet1d | 68.913 | 0.405 |
| DiffusionBC-PearceMLP | 0.834 | 0.062 |
| DiT1dLnet (ours) | 13.482 | 0.203 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Liao, J.; He, W.; Yu, Q.; Chen, F. DiT1dLnet: A Fast and Accurate Diffusion Model Structure Based on Robot Behavior Imitation. Mathematics 2026, 14, 1785. https://doi.org/10.3390/math14111785
Liao J, He W, Yu Q, Chen F. DiT1dLnet: A Fast and Accurate Diffusion Model Structure Based on Robot Behavior Imitation. Mathematics. 2026; 14(11):1785. https://doi.org/10.3390/math14111785
Chicago/Turabian StyleLiao, Jiaxin, Weiyuan He, Qing Yu, and Fei Chen. 2026. "DiT1dLnet: A Fast and Accurate Diffusion Model Structure Based on Robot Behavior Imitation" Mathematics 14, no. 11: 1785. https://doi.org/10.3390/math14111785
APA StyleLiao, J., He, W., Yu, Q., & Chen, F. (2026). DiT1dLnet: A Fast and Accurate Diffusion Model Structure Based on Robot Behavior Imitation. Mathematics, 14(11), 1785. https://doi.org/10.3390/math14111785

