ThinkGrasp: A Vision-Language System for Strategic Part Grasping in Clutter

Abstract

Robotic grasping in cluttered environments remains a significant challenge due to occlusions and complex object arrangements. We have developed ThinkGrasp, a plug-and-play vision-language grasping system that makes use of GPT-4o’s advanced contextual reasoning for grasping strategies. ThinkGrasp can effectively identify and generate grasp poses for target objects, even when they are heavily obstructed or nearly invisible, by using goal-oriented language to guide the removal of obstructing objects. This approach progressively uncovers the target object and ultimately grasps it with a few steps and a high success rate. In both simulated and real experiments, ThinkGrasp achieved a high success rate and significantly outperformed state-of-the-art methods in heavily cluttered environments or with diverse unseen objects, demonstrating strong generalization capabilities.

Publication
In CoRL 2024
Linfeng Zhao
Linfeng Zhao
CS Ph.D. Student

I am a CS Ph.D. student at Khoury College of Computer Sciences of Northeastern University, advised by Prof. Lawson L.S. Wong. My research interests include reinforcement learning, artificial intelligence, and robotics.

Related