RGBSQ-Grasp: Inferring Local Superquadric Primitives from Single RGB image for Graspability-Aware Bin Picking

Abstract

Superquadrics (SQ) offer a compact, interpretable shape representation that captures the physical and graspability understanding of objects. In this work, we propose RGBSQGrasp, a grasping framework that leverages superquadric shape primitives and foundation metric depth estimation models to infer grasp poses from a monocular RGB camera—eliminating the need for depth sensors. Our framework integrates a cross-platform dataset generation pipeline, a foundation model-based object point cloud estimation module, a global-local superquadric fitting network, and an SQ-guided grasp pose sampling module. By integrating these components, RGBSQGrasp reliably infers grasp poses through geometric reasoning. Real-world robotic experiments demonstrate a 92% grasp success rate, highlighting the effectiveness of RGBSQGrasp in packed bin-picking environments.

Superquadric Representation

Superquadrics (SQs) are a compact family of parametric primitives that can model cylinders, spheres, ellipsoids, rounded boxes, and more, by varying a small set of shape exponents and scales. An SQ can be defined by the following implicit function; the surface is where the function equals zero, points with negative value are inside, and positive value are outside:

\begin{equation} F(\mathbf{x}) \;=\; \left( \left|\frac{x}{a_x}\right|^{\frac{2}{\epsilon_2}} +\left|\frac{y}{a_y}\right|^{\frac{2}{\epsilon_2}} \right)^{\frac{\epsilon_2}{\epsilon_1}} \;+\; \left|\frac{z}{a_z}\right|^{\frac{2}{\epsilon_1}} \;-\;1. \label{eq:sq-implicit} \end{equation}

Here $\mathbf{x}=[x,y,z]^{\top}\!\in\mathbb{R}^3$ is expressed in the SQ’s local coordinates; $a_x,a_y,a_z>0$ are axis scales; and $\epsilon_1,\epsilon_2\ge 0$ are shape exponents controlling curvature and “boxiness.” Smaller exponents yield rounder shapes; larger values sharpen faces and edges. The image below visualizes how varying $(\epsilon_1,\epsilon_2)$ changes the shape.

Grid of superquadrics… — **Figure.** Rows vary $ \epsilon_1 $; columns vary $ \epsilon_2 $. Higher exponents produce squarer silhouettes.

For surface sampling and rendering, it is convenient to use the following explicit (parametric) form:

\begin{equation} \mathbf{r}(\eta,\omega) \;=\; \begin{bmatrix} a_x\,\cos^{\epsilon_1}\!\eta\,\cos^{\epsilon_2}\!\omega \\ a_y\,\cos^{\epsilon_1}\!\eta\,\sin^{\epsilon_2}\!\omega \\ a_z\,\sin^{\epsilon_1}\!\eta \end{bmatrix}, \qquad \eta\in\left[-\tfrac{\pi}{2},\,\tfrac{\pi}{2}\right],\; \omega\in[-\pi,\,\pi]. \label{eq:sq-explicit} \end{equation}

A full superquadric instance is determined by 11 parameters: three scales $(a_x,a_y,a_z)$, two shape exponents $(\epsilon_1,\epsilon_2)$, and a 6-DoF pose in 3D space. The pose is written as $\mathbf{g}=[\mathbf{R},\mathbf{t}] \in SE(3)$ with $\mathbf{R}\in SO(3)$ the rotation and $\mathbf{t}\in\mathbb{R}^3$ the translation. Equations \eqref{eq:sq-implicit}–\eqref{eq:sq-explicit} are the forms we use for inference, fitting, and point-cloud sampling.

RGBSQ-Grasp Framework

Our method addresses the challenge of enabling robots to understand cluttered scenes and identify graspable regions from single monocular RGB input in bin-picking tasks. As shown in Figure 1, our framework processes scenes in multiple phases: Dataset Generation creates a synthetic dataset to simulate occlusions; the Superquadrics Fitting Network leverages synthetic data to train a network that extracts superquadric primitives from partial point clouds; during inference, SQ-guided Scene Understanding estimates the SQ-represented scene using vision foundation models; and Grasp Pose Sampling generates grasp poses.

Selected Demonstration Video

Below is the selected demonstration video showcasing the effectiveness of RGBSQ-Grasp in cluttered bin-picking environments.

RGB-SQGrasp: Inferring Local Superquadric Primitives from Single RGB image for Graspability-Aware Bin Picking

Abstract

Superquadric Representation

RGBSQ-Grasp Framework

Selected Demonstration Video