Point Cloud Matters: Rethinking the Impact of Different Observation Spaces on Robot Learning

Haoyi Zhu^1,2, Yating Wang^1,3, Di Huang¹, Weicai Ye^1,4, Wanli Ouyang¹, Tong He¹

¹Shanghai Artificial Intelligence Laboratory, ²University of Science and Technology of China, ³Northwestern Polytechnical University, ⁴Zhejiang University

Paper Code

OBSBench

We develop OBSBench, a benchmark with standardized pipelines that include various encoders, PVRs, policies, simulators, evaluation settings, etc. We then examine the impact of various observation spaces, specifically RGB, RGB-D, and point clouds, on robot learning.

Abstract

In robot learning, the observation space is crucial due to the distinct characteristics of different modalities, which can potentially become a bottleneck alongside policy design. In this study, we explore the influence of various observation spaces on robot learning, focusing on three predominant modalities: RGB, RGB-D, and point cloud. We introduce OBSBench, a benchmark comprising two simulators and 125 tasks, along with standardized pipelines for various encoders and policy baselines. Extensive experiments on diverse contact-rich manipulation tasks reveal a notable trend: point cloud-based methods, even those with the simplest designs, frequently outperform their RGB and RGB-D counterparts. This trend persists in both scenarios: training from scratch and utilizing pre-training. Furthermore, our findings demonstrate that point cloud observations often yield better policy performance and significantly stronger generalization capabilities across various geometric and visual conditions. These outcomes suggest that the 3D point cloud is a valuable observation modality for intricate robotic tasks. We also suggest that incorporating both appearance and coordinate information can enhance the performance of point cloud methods. We hope our work provides valuable insights and guidance for designing more generalizable and robust robotic models.

Experiments

Based on OBSBench, we conduct extensive experiments on different observation modalities across 19 diverse representative tasks. Our experiments aim to address the following research questions:

Q1: How do varying observation spaces influence robot learning performance?
Q2: What is the performance impact of pre-trained visual representations (PVRs)?
Q3: How are the zero-shot generalization capabilities across observation spaces?
Q4: What is the sample efficiency across observation spaces?
Q5: How do different design decisions influence point cloud performance?

Performance of different observations with and without pre-training(Q1, Q2)

Finding 1: We observe that using a point cloud encoder results in the highest mean success rate and mean rank. Point cloud methods consistently outperform other modalities, whether employing ACT policy or diffusion policy.

Finding 2: The depth modality, despite also providing 3D information, generally degrades performance across all settings. Although the RGB-D version of ResNet has a slightly better mean success rate than the RGB version, it performs significantly worse on 7 tasks and has a lower mean rank, indicating instability.

Finding 3: Using PVRs can lead to better performance on average, though not for all individual tasks.

Performance of different observations with and without pre-training(Q3)

Finding 4: All methods are significantly affected by camera view changes, even with only 5 degrees. However, point cloud data, both pre-trained and from scratch, shows notable resilience. This suggests that image-based models are overly dependent on specific training views.

Finding 5: We observe that point cloud methods generally exhibit better generalization than other observation spaces. Specifically, we find that SpUNet demonstrates significantly greater robustness to foreground visual changes, whereas PointNet's performance drops dramatically to near zero in these scenarios. Conversely, PointNet shows superior generalization to changes in camera view.

Finding 6: Utilizing PVRs generally improves model generalization,especially when semantic information is incorporated during pre-training, as seen with MultiMAE and PonderV2.

sample efficiency(Q4)

Finding 7: Our analysis reveals that point cloud observation spaces do not demonstrate a significant advantage in sample efficiency compared to other modalities. Notably, our results indicate that PVRs consistently improve performance in scenarios with limited training data.

Design decisions on point cloud observation space (Q5)

Finding 8: Post-sampling, i.e., FPS sampling on the feature map after the encoder, can significantly enhance the performance of point cloud-based methods, since it can maintain better local information.

Finding 9: Coordinate information is more critical than color information, as removing coordinate features results in a larger performance drop.

Finding 10: Pointmap, which seemingly integrates the advantages of both 2D images and 3D information, consistently outperforms RGB-only and RGB-D methods. However, it still lags behind point clouds, especially when using diffusion policies.

Conclusion

In this study, we introduce OBSBench to advance research on various observation spaces in robot learning. Our findings indicate that point cloud methods consistently outperform RGB and RGB-D in terms of success rate and robustness across different conditions, regardless of whether they are trained from scratch or pre-trained. Design choices, such as post-sampling and the inclusion of coordinate and color information, further enhance their performance. However, point cloud methods face challenges with sample efficiency. Utilizing large-scale 3D datasets like RH20T and DL3DV-10K could improve their robustness and generalization. Future research should explore dynamic sampling techniques and multi-modal integration, including tactile sensing. Although our experiments are conducted on simulated benchmarks to ensure consistency and fairness, translating and validating these findings in real-world scenarios in a reproducible and credible manner remains an open question. In the short term, we do not foresee any negative societal impacts from this work. However, as our results contribute to the development of more robust robotic systems, it is crucial to study how to prevent robots from causing harm in daily life in the long run.

BibTeX

@article{zhu2024point,
  title = {Point Cloud Matters: Rethinking the Impact of Different Observation Spaces on Robot Learning},
  author = {Zhu, Haoyi and Wang, Yating and Huang, Di and Ye, Weicai and Ouyang, Wanli and He, Tong},
  journal = {arXiv preprint arXiv:2402.02500},
  year = {2024},
}