TARGO: Benchmarking Target-driven Object Grasping under Occlusions

1Technical University of Munich,2CFCS, School of CS, Peking University,
3University of Oxford,4Munich Center for Machine Learning (MCML)
*Indicates Equal Contribution

Abstract

Recent advances in predicting 6D grasp poses from a single depth image have led to promising performance in robotic grasping. However, previous grasping models face challenges in cluttered environments where nearby objects impact the target object's grasp. In this paper, we first establish a new benchmark dataset for TARget-driven Grasping under Occlusions, named TARGO. We make the following contributions:

1) We are the first to study the occlusion level in target-driven grasping.

2) We set up an evaluation benchmark consisting of large-scale synthetic data and part of real-world data, and we evaluated five grasp models and found that even the current SOTA model suffers when the occlusion level increases, leaving grasping under occlusion still a challenge.

3) We also generate a large-scale training dataset via a scalable pipeline, which can be used to boost the performance of grasping under occlusion and generalized to the real world.

4) We further propose a transformer-based grasping model involving a shape completion module, termed TARGO-Net, which performs most robustly as occlusion increases.

TARGO Dataset

Data Overview

We present a novel dataset, TARGO, to thoroughly investigate the occlusion challenge in robotic grasping. The synthetic dataset can be used to train and evaluate the target-driven grasping models under various occlusion levels, and the real-world one can be tested for zero-shot transfer to evaluate grasping performance in practice.

The TARGO-synthetic training and test datasets are generated using the physical simulator PyBullet, which samples collision-free grasps in a $30 \times 30 \times 30 \mathrm{~cm}^3$ tabletop workspace. We generated the packed scenes using 114 training and 16 test objects provided by VGN. In each cluttered scene, an object is selected as the target object, inducing a single scene, where only the target is on the table and all other occluders are removed. Given the single scene, occlusion level can be calculated (the percentage of the target object occluded by the occluders). Sampling grasps from single scenes helps grasp prediction, and the occlusion level can be used to evaluate the model's robustness to occlusion. The TARGO-Synthetic training dataset is available for download here.

The TARGO-Real dataset includes object models and object poses in scenes provided by DexYCB. Using the real-world observations, we created the corresponding scenes in PyBullet to evaluate various baseline methods.

Cluttered Scene vs. Single Scene

Cluttered Scene
cluttered scene
Single Scene
single scene

TARGO-Synthetic Dataset Folder Contents

The compressed archive file contains the following:

  • mesh_pose_dict/: Each file is named by a unique scene ID, c / s (cluttered or single scene), and target ID. For instance, 0006ab82ea3f4b8aad9445b27f45487b_s_1.npz represents a scene with ID 0006ab82ea3f4b8aad9445b27f45487b, a single scene, and the target object is the first object in the scene. The file contains the object meshes' file path and the poses of objects in the scene.
  • scenes/: Same naming convention as the mesh_pose_dict folder. Each file contains the depth images, target masks, TSDF grids of the scene, and depth back-projected target points and occlusion levels.
  • grasps.csv: This file lists the scene IDs of cluttered or single scenes along with their associated 6DoF grasp poses, grasp widths, and grasp labels (1 for success, 0 for failure).

TARGO-Net Real-world Experiments in 2.5x Speed

TARGO-Net vs. GIGA

TARGO-Net Successful Grasps

TARGO-Net Failed Grasps