3D-LFM: Lifting Foundation Model

1Carnegie Mellon University, 2The University of Adelaide

CVPR, 2024

3D-LFM, a universal 2D-3D lifting model, processes diverse objects without category-specific knowledge. It uses transformers' permutation equivariance and geometric consistency to handle camera rotations, standardizing shape representation in a canonical frame.

Unified 3D-LFM Model

A single model for 30+ object categories.

The 3D-LFM scales to multiple categories (30+ in our experiments), managing diverse landmark configurations through proposed architectural changes. See our paper for more details. Key: Red for ground truth, blue for predictions.

   

OOD Generalization

The 3D-LFM is capable in recognizing and reconstructing objects in configurations and with a number of landmarks never encountered during its training phase. Above object categories are never seen by the model in training.

Abstract

The lifting of 3D structure and camera from 2D landmarks is at the cornerstone of the entire discipline of computer vision. Traditional methods have been confined to specific rigid objects, such as those in Perspective-n-Point (PnP) problems, but deep learning has expanded our capability to reconstruct a wide range of object classes (e.g. C3PDO and PAUL) with resilience to noise, occlusions, and perspective distortions. All these techniques, however, have been limited by the fundamental need to establish correspondences across the 3D training data – significantly limiting their utility to applications where one has an abundance of "in-correspondence" 3D data. Our approach harnesses the inherent permutation equivariance of transformers to manage varying number of points per 3D data instance, withstands occlusions, and generalizes to unseen categories. We demonstrate state of the art performance across 2D-3D lifting task benchmarks. Since our approach can be trained across such a broad class of structures we refer to it simply as a 3D Lifting Foundation Model (3D-LFM) -– the first of its kind.

Architectural Overview
Architectural diagram of the 3D-LFM model: The system encodes 2D keypoints via TPE, processes them through Transformer layers, and decodes to a canonical 3D shape using an MLP. The shape is aligned to ground truth using Procrustean methods, ensuring local and global context capture with computational efficiency.