3D Vision and Language Pretraining with Large-Scale Synthetic Data

Dejie Yang¹, Zhu Xu¹, Wentao Mo¹, Qingchao Chen^2,3, Siyuan Huang⁴, Yang Liu^1,3*

¹Wangxuan Institute of Computer Technology, Peking University
²National Institute of Health Data Science, Peking University
³National Key Laboratory of General Artificial Intelligence, Peking University
⁴State Key Laboratory of General Artificial Intelligence, BIGAI

IJCAI2024
^*Corresponding Author

Abstract

3D Vision-Language Pre-training (3D-VLP) aims to provide a pre-train model which can bridge 3D scenes with natural language, which is an impor- tant technique for embodied intelligence. How- ever, current 3D-VLP datasets are hindered by limited scene-level diversity and insufficient fine- grained annotations (only 1.2K scenes and 280K textual annotations in ScanScribe), primarily due to the labor-intensive of collecting and annotating 3D scenes. To overcome these obstacles, we con- struct SynVL3D, a comprehensive synthetic scene- text corpus with 10K indoor scenes and 1M de- scriptions at object, view, and room levels, which has the advantages of diverse scene data, rich tex- tual descriptions, multi-grained 3D-text associa- tions, and low collection cost. Utilizing the rich annotations in SynVL3D, we pre-train a simple and unified Transformer for aligning 3D and language with multi-grained pretraining tasks. Moreover, we propose a synthetic-to-real domain adaptation in downstream task fine-tuning process to address the domain shift. Through extensive experiments, we verify the effectiveness of our model design by achieving state-of-the-art performance on down- stream tasks including visual grounding, dense cap- tioning, and question answering.

Framework

The model architecture of our SynFormer3D. The multi-modal encoder includes a 3D-object encoder, text encoder, and cross-modal fusion modules. Compared to previous pre-trained models, our SynFormer3D introduces more fine-grained auxiliary pre-training tasks, which include Object Relation Prediction , Multi-level and View-aggregated Region-Word Alignment.

Results

1.Grounding accuracy (%) on Nr3D, Sr3D and ScanRefer. The best and second-best results are in bold and underlined.

2. Dense Captioning results on Scan2Cap dataset. “C” stands for “CIDEr”, “B-4” for “BLEU-4”, “M” for “METEOR”, and “R” for “ROUGE”, respectively. “@0.25” and “@0.5” represent the 3D IoU between the predicted and annotated box. The best results are in bold and underlined.

3. Answer accuracy on ScanQA using object proposals from Mask3D. Each entry denotes “test w/ object” and second-best results are in bold and underlined.

BibTeX

@inproceedings{3DSyn, title = {3D Vision and Language Pretraining with Large-Scale Synthetic Data}, author = {Dejie Yang, Zhu Xu, Wentao Mo, Qingchao Chen, Siyuan Huang, Yang Liu}, booktitle = {Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, {IJCAI-24}}, publisher = {International Joint Conferences on Artificial Intelligence Organization}, year = {2024}, }