3D Vision and Language Pretraining with Large-Scale Synthetic Data

Dejie Yang1, Zhu Xu1, Wentao Mo1, Qingchao Chen2,3, Siyuan Huang4, Yang Liu1,3*

1Wangxuan Institute of Computer Technology, Peking University
2National Institute of Health Data Science, Peking University
3National Key Laboratory of General Artificial Intelligence, Peking University
4State Key Laboratory of General Artificial Intelligence, BIGAI

IJCAI2024

*Corresponding Author
MY ALT TEXT

Advantages of our proposed dataset SynVL3D.

Abstract

3D Vision-Language Pre-training (3D-VLP) aims to provide a pre-train model which can bridge 3D scenes with natural language, which is an impor- tant technique for embodied intelligence. How- ever, current 3D-VLP datasets are hindered by limited scene-level diversity and insufficient fine- grained annotations (only 1.2K scenes and 280K textual annotations in ScanScribe), primarily due to the labor-intensive of collecting and annotating 3D scenes. To overcome these obstacles, we con- struct SynVL3D, a comprehensive synthetic scene- text corpus with 10K indoor scenes and 1M de- scriptions at object, view, and room levels, which has the advantages of diverse scene data, rich tex- tual descriptions, multi-grained 3D-text associa- tions, and low collection cost. Utilizing the rich annotations in SynVL3D, we pre-train a simple and unified Transformer for aligning 3D and language with multi-grained pretraining tasks. Moreover, we propose a synthetic-to-real domain adaptation in downstream task fine-tuning process to address the domain shift. Through extensive experiments, we verify the effectiveness of our model design by achieving state-of-the-art performance on down- stream tasks including visual grounding, dense cap- tioning, and question answering.

Framework

MY ALT TEXT

The model architecture of our SynFormer3D. The multi-modal encoder includes a 3D-object encoder, text encoder, and cross-modal fusion modules. Compared to previous pre-trained models, our SynFormer3D introduces more fine-grained auxiliary pre-training tasks, which include Object Relation Prediction , Multi-level and View-aggregated Region-Word Alignment.

Results

1.Grounding accuracy (%) on Nr3D, Sr3D and ScanRefer. The best and second-best results are in bold and underlined.

MY ALT TEXT

2. Dense Captioning results on Scan2Cap dataset. “C” stands for “CIDEr”, “B-4” for “BLEU-4”, “M” for “METEOR”, and “R” for “ROUGE”, respectively. “@0.25” and “@0.5” represent the 3D IoU between the predicted and annotated box. The best results are in bold and underlined.

MY ALT TEXT

3. Answer accuracy on ScanQA using object proposals from Mask3D. Each entry denotes “test w/ object” and second-best results are in bold and underlined.

MY ALT TEXT

BibTeX

@inproceedings{3DSyn,
        title     = {3D Vision and Language Pretraining with Large-Scale Synthetic Data},
        author    = {Dejie Yang, Zhu Xu, Wentao Mo, Qingchao Chen, Siyuan Huang, Yang Liu},
        booktitle = {Proceedings of the Thirty-Second International Joint Conference on
                     Artificial Intelligence, {IJCAI-24}},
        publisher = {International Joint Conferences on Artificial Intelligence Organization},
        year      = {2024},
      }