Prompt Depth Anything

Prompting Depth Anything for 4K Resolution Accurate Metric Depth Estimation

Haotong Lin^1,2 Sida Peng¹ Jingxiao Chen ³ Songyou Peng⁴ Jiaming Sun¹ Minghuan Liu³
Hujun Bao¹ Jiashi Feng² Xiaowei Zhou¹ Bingyi Kang²

¹Zhejiang University ²ByteDance Seed ³Shanghai Jiao Tong University ⁴ETH Zurich

Paper

Prompt Depth Anything is a high-resolution and accurate metric depth estimation method, with the following highlights:

We use prompting to unleash the power of depth foundation models, inspired by success of prompting in VLM and LLM foundation models.

The widely available iPhone LiDAR is taken as the prompt, guiding the model to produce up to 4K resolution accurate metric depth.

A scalable data pipeline is introduced to train our method; We release a more detailed depth annotation for ScanNet++ dataset.

Prompt Depth Anything benefits downstream applications, including 3D reconstruction and generalized robotic grasping.

Abstract

Prompts play a critical role in unleashing the power of language and vision foundation models for specific tasks. For the first time, we introduce prompting into depth foundation models, creating a new paradigm for metric depth estimation termed Prompt Depth Anything. Specifically, we use a low-cost LiDAR as the prompt to guide the Depth Anything model for accurate metric depth output, achieving up to 4K resolution. Our approach centers on a concise prompt fusion design that integrates the LiDAR at multiple scales within the depth decoder. To address training challenges posed by limited datasets containing both LiDAR depth and precise GT depth, we propose a scalable data pipeline that includes synthetic data LiDAR simulation and real data pseudo GT depth generation. Our approach sets new state-of-the-arts on the ARKitScenes and ScanNet++ datasets and benefits downstream applications, including 3D reconstruction and generalized robotic grasping.

Comparisons with ARKit LiDAR Depth

A dense LiDAR is accurate but high-cost. A low-cost LiDAR is preferred but its depth is low-res and noisy due to the limited power.

ARKit LiDAR depth is a low-resolution depth map generated by the ARKit API using the 24x24 points iPhone LiDAR and RGB images.

Application: Generalized Robotic Grasping

Even if the grasping policy is trained only on diffuse objects, our depth can help grasp transparent and specular ones, outperforming RGB and LiDAR.

This demo is implemented with our ViT-Small model, which can be run in 94+ FPS on a single RTX 4090 GPU. The video is sped up 2x for demonstration.

Acknowledgements

We thank the generous support from Prof. Weinan Zhang for robot experiments, including the space, objects and the Unitree H1 robot. We also thank Zhengbang Zhu, Jiahang Cao, Xinyao Li, Wentao Dong for their help in setting up the robot platform and collecting robot data.

Prompting Depth Anything for 4K Resolution Accurate Metric Depth Estimation

Prompt Depth Anything is a high-resolution and accurate metric depth estimation method, with the following highlights:

Abstract

Comparisons with Monocular Depth Methods

Monocular depth methods can generate high-res depth maps, but struggle with consistent metric scale information, even after aligning with LiDAR.

Comparisons with ARKit LiDAR Depth

A dense LiDAR is accurate but high-cost. A low-cost LiDAR is preferred but its depth is low-res and noisy due to the limited power.

More Testing Results

Application: Street Reconstruction

The prompts can be replaced with vehicle LiDAR, achieving high-precision depth estimation in street scenes.

Application: Generalized Robotic Grasping

Even if the grasping policy is trained only on diffuse objects, our depth can help grasp transparent and specular ones, outperforming RGB and LiDAR.

Acknowledgements

Citation