To improve the training stability of DP-T and the consistency of each action embedding, our proposed ScaleDP model makes two key modifications to the original DP-T model. (1) Compared to fusing conditional information with cross-attention in DP-T, ScaleDP fuses the conditional information with adaptive normalization layers, which can stabilize the training process of the model. (2) ScaleDP uses an unmasking strategy that allows the policy network to "see" future actions during prediction, helping to reduce compounding errors.
We investigate whether ScaleDP can get better performance across a wide spectrum of robot manipulation tasks and environments. To this end, we evaluate the performance of ScaleDP on 7 real-world robot manipulation tasks, including 4 single arm tasks and 3 bimanual tasks.
Model | Single Arm Tasks | Bimanual Tasks | Average | |||||
---|---|---|---|---|---|---|---|---|
Close Laptop | Flip Mug | Stack Cube | Place Tennis | Put Tennis into Bag | Sweep Trash | Bimanual Stack Cube | ||
DP-T | 80 | 70 | 50 | 5 | 20 | 50 | 0 | 39.28±29.08 |
ScaleDP-S | 85 | 70 | 50 | 30 | 100 | 50 | 10 | 56.42±28.87 |
ScaleDP-B | 80 | 65 | 50 | 55 | 100 | 60 | 10 | 60.00±25.77 |
ScaleDP-L | 95 | 80 | 70 | 50 | 100 | 80 | 90 | 80.71±15.68 |
ScaleDP-H | 95 | 95 | 90 | 70 | 100 | 95 | 100 | 92.14±9.58 |
Qualitatively, we observe that ScaleDP works more smoothly and precisely compared to DP-T.
ScaleDP (Ours)
DP-T
@article{park2021nerfies,
title = {Scalable Diffusion Transformer Policy for Visuomotor Learning},
author = {Minjie Zhu},
booktitle = {Arxiv},
year = {2024},
}