Scalable Diffusino Policy: Scale Up DiffusionPolicy via Transformers for Visuomotor Learning

1Midea Group, 2East China Normal University, 3Standford University, 4Shanghai University
*Equal contribution

Abstract

Diffusion Policy is a potent tool for learning visuomotor robot control. It is expected that Diffusion Policy possesses scalability, a key attribute for deep neural networks, typically suggesting that increasing model size would lead to enhanced performance. However, our observations indicate that Diffusion Policy in transformer architecture (DP-T) struggles to scale effectively; even minor additions of layers can deteriorate training outcomes. To address this issue, we introduce Scalable Diffusion Transformer Policy for visuomotor learning. Our proposed method, namely ScaleDP, introduces two modules that improve the training dynamic of Diffusion Policy and allow the network to better handle multimodal action distribution. First, we identify that DP-T suffers from large gradient issues, making the optimization of Diffusion Policy unstable. To resolve this issue, we factorize the feature embedding of observation into multiple affine layers, and integrate it into the transformer blocks. Additionally, our unmasking strategy allows the policy network to "see" future actions during prediction, helping to reduce compounding errors. We demonstrate that our proposed method successfully scales the Diffusion Policy from 10 million to 1 billion parameters. This new model, named ScaleDP, can effectively scale up the model size with improved performance and generalization. We benchmark ScaleDP across 50 different tasks from MetaWorld and find that our largest ScaleDP outperforms DP-T with an average improvement of 21.6%. In four real robot tasks, ou rScaleDP demonstrated an average improvement of 22.5% over DP-T. Through multiple simulations and real-world experiments, we validate the superior performance of ScaleDP compared to the traditional diffusion transformer policy. We believe our work paves the way for scaling up models for visuomotor learning.

Model Architecture

To improve the training stability of DP-T and the consistency of each action embedding, our proposed ScaleDP model makes two key modifications to the original DP-T model. (1) Compared to fusing conditional information with cross-attention in DP-T, ScaleDP fuses the conditional information with adaptive normalization layers, which can stabilize the training process of the model. (2) ScaleDP uses an unmasking strategy that allows the policy network to "see" future actions during prediction, helping to reduce compounding errors.


Experiments

We investigate whether ScaleDP can get better performance across a wide spectrum of robot manipulation tasks and environments. To this end, we evaluate the performance of ScaleDP on 7 real-world robot manipulation tasks, including 4 single arm tasks and 3 bimanual tasks.

Quantitative Comparison

Success rates on four single arm tasks and three bimanual tasks. It is worth noting that as the model size increases, the average success rate also increases correspondingly, demonstrating the scalability of our model architecture. Each single arm task is evaluated with 20 trials, and each bimanual task is evaluated with 10 trials.
Model Single Arm Tasks Bimanual Tasks Average
Close Laptop Flip Mug Stack Cube Place Tennis Put Tennis into Bag Sweep Trash Bimanual Stack Cube
DP-T 80 70 50 5 20 50 0 39.28±29.08
ScaleDP-S 85 70 50 30 100 50 10 56.42±28.87
ScaleDP-B 80 65 50 55 100 60 10 60.00±25.77
ScaleDP-L 95 80 70 50 100 80 90 80.71±15.68
ScaleDP-H 95 95 90 70 100 95 100 92.14±9.58

Qualitative Comparison

Qualitatively, we observe that ScaleDP works more smoothly and precisely compared to DP-T.

ScaleDP (Ours)

DP-T

BibTeX

@article{park2021nerfies,
  title     = {Scalable Diffusion Transformer Policy for Visuomotor Learning},
  author    = {Minjie Zhu},
  booktitle = {Arxiv},
  year      = {2024},
}