TikTok's Depth Anything Model Sets New Standard for Robust Image-Based Depth Estimation

2024-01-24

Researchers from TikTok, the University of Hong Kong, and Zhejiang Lab have shared a new study that sets an impressive benchmark for monocular depth estimation (MDE) across various datasets and metrics. The study, titled "Depth Anything," marks significant progress in AI-driven depth perception, particularly in understanding and interpreting the depth of objects within a single image. At the core of "Depth Anything" is its training on a massive dataset: 1.5 million labeled images and over 62 million unlabeled images. This extensive training set is crucial for the model's proficiency. Unlike traditional MDE models that heavily rely on smaller labeled datasets, "Depth Anything" leverages a large amount of unlabeled data. This approach significantly expands the data coverage, thereby reducing generalization errors, which are common challenges in AI models. A New Benchmark in Depth Estimation What sets "Depth Anything" apart? Firstly, it achieves zero-shot relative depth estimation, outperforming MiDaS v3.1 (BEiTL-512) in terms of performance. This means it can estimate depth in unseen images without any additional training. Additionally, it surpasses ZoeDepth in zero-shot metric depth estimation, indicating its ability to assess the actual distance between objects and the camera. When fine-tuned on NYUv2 and KITTI metric depth data, the "Depth Anything" model also achieves new state-of-the-art results on these widely used benchmarks. This combination of strong generalization and fine-tuning performance makes "Depth Anything" a new cornerstone in monocular depth estimation research.