One-Shot Real-to-Sim via End-to-End Differentiable Simulation and Rendering

1Yale University, 2Independent researcher

Abstract

Identifying predictive world models for robots from sparse online observations is essential for robot task planning and execution in novel environments. However, existing methods that leverage differentiable programming to identify world models are incapable of jointly optimizing the geometry, appearance, and physical properties of the scene. In this work, we introduce a novel rigid object representation that allows the joint identification of these properties. Our method employs a novel differentiable point-based geometry representation coupled with a grid-based appearance field, which allows differentiable object collision detection and rendering. Combined with a differentiable physical simulator, we achieve end-to-end optimization of world models or rigid objects, given the sparse visual and tactile observations of a physical motion sequence. Through a series of world model identification tasks in simulated and real environments, we show that our method can learn both simulation- and rendering-ready rigid world models from only one robot action sequence.

spotlight

Simulation Experiments

GT data generated by PyBullet. All videos are 1/10x.

Simulation 1: Bleach

Simulation 2: Mustard

Simulation 3: Sugar


Real-World Experiments

The end-effector is represented as a blue sphere . All videos are 1/3x.

Real Experiment 1: Drill

Real Experiment 2: Mustard

Real Experiment 3: Sugar

Methodology

Method overview

Overview of the proposed fully differentiable pipeline for world model identification from sparse robot observations. Our object representation couples an oriented point cloud P and a 3D appearance grid ψ. Through a differentiable Poisson solver and differentiable marching cubes, the oriented point cloud is converted to an indicator grid χ and then a mesh, whose vertex textures are interpolated from the appearance grid ψ. Feeding the object mesh, physical parameters M and μ, the terrain point cloud Pt, and the robot pushing trajectory and control <et, ut> into a differentiable rigid body simulator and renderer, the predicted scenes can be rendered. Calculating the loss against observed RGB-D images, the scene shape, appearance, and physical parameters are jointly optimized with gradient descent.