Abstract

Splatter Image, a recent approach for monocular 3D object reconstruction, achieves high efficiency using Gaussian splatting while maintaining state-of-the-art performance. In this work, we propose enhancements to this project through three contributions:

  1. Incorporating semantic embeddings from pre-trained vision-language models to provide richer contextual understanding
  2. Integrating monocular depth estimation to improve geometric accuracy
  3. Enhancing loss calculations by using Total Variation and Edge losses to refine reconstruction details

Our experiments show that semantic conditioning, particularly using DINO embeddings, significantly improves view consistency and generalization. Depth information further enhances reconstruction quality by constraining the solution space, but loss modifications do not bring substantially improvements.

Code is available at https://github.com/splatter-works/splatter-image.

More details can be found in our full report: