Prompting Depth Anything for 4K Resolution Accurate Metric Depth Estimation


Haotong Lin1,2     Sida Peng1     Jingxiao Chen 3     Songyou Peng4     Jiaming Sun1     Minghuan Liu3    
Hujun Bao1     Jiashi Feng2     Xiaowei Zhou1     Bingyi Kang2
1Zhejiang University     2ByteDance Seed     3Shanghai Jiao Tong University     4ETH Zurich



Prompt Depth Anything is a high-resolution and accurate metric depth estimation method, with the following highlights:
  • We use prompting to unleash the power of depth foundation models, inspired by success of prompting in VLM and LLM foundation models.
  • The widely available iPhone LiDAR is taken as the prompt, guiding the model to produce up to 4K resolution accurate metric depth.
  • A scalable data pipeline is introduced to train our method; We release a more detailed depth annotation for ScanNet++ dataset.
  • Prompt Depth Anything benefits downstream applications, including 3D reconstruction and generalized robotic grasping.


  • Abstract


    Prompts play a critical role in unleashing the power of language and vision foundation models for specific tasks. For the first time, we introduce prompting into depth foundation models, creating a new paradigm for metric depth estimation termed Prompt Depth Anything. Specifically, we use a low-cost LiDAR as the prompt to guide the Depth Anything model for accurate metric depth output, achieving up to 4K resolution. Our approach centers on a concise prompt fusion design that integrates the LiDAR at multiple scales within the depth decoder. To address training challenges posed by limited datasets containing both LiDAR depth and precise GT depth, we propose a scalable data pipeline that includes synthetic data LiDAR simulation and real data pseudo GT depth generation. Our approach sets new state-of-the-arts on the ARKitScenes and ScanNet++ datasets and benefits downstream applications, including 3D reconstruction and generalized robotic grasping.


    Comparisons with Monocular Depth Methods


    Monocular depth methods can generate high-res depth maps, but struggle with consistent metric scale information, even after aligning with LiDAR.

    Comparisons with ARKit LiDAR Depth


    A dense LiDAR is accurate but high-cost. A low-cost LiDAR is preferred but its depth is low-res and noisy due to the limited power.

    ARKit LiDAR depth is a low-resolution depth map generated by the ARKit API using the 24x24 points iPhone LiDAR and RGB images.


    More Testing Results



    Application: Street Reconstruction


    The prompts can be replaced with vehicle LiDAR, achieving high-precision depth estimation in street scenes.

    Application: Generalized Robotic Grasping


    Even if the grasping policy is trained only on diffuse objects, our depth can help grasp transparent and specular ones, outperforming RGB and LiDAR.

    This demo is implemented with our ViT-Small model, which can be run in 94+ FPS on a single RTX 4090 GPU. The video is sped up 2x for demonstration.


    Acknowledgements


    We thank the generous support from Prof. Weinan Zhang for robot experiments, including the space, objects and the Unitree H1 robot. We also thank Zhengbang Zhu, Jiahang Cao, Xinyao Li, Wentao Dong for their help in setting up the robot platform and collecting robot data.

    Citation


    @inproceedings{lin2024promptda,
      title={Prompting Depth Anything for 4K Resolution Accurate Metric Depth Estimation},
      author={Lin, Haotong and Peng, Sida and Chen, Jingxiao and Peng, Songyou and Sun, Jiaming and Liu, Minghuan and Bao, Hujun and Feng, Jiashi and Zhou, Xiaowei and Kang, Bingyi},
      journal={arXiv},
      year={2024}
    }