3D Vision-Language Pre-training (3D-VLP) aims to provide a pre-train model which can bridge 3D scenes with natural language, which is an impor- tant technique for embodied intelligence. How- ever, current 3D-VLP datasets are hindered by limited scene-level diversity and insufficient fine- grained annotations (only 1.2K scenes and 280K textual annotations in ScanScribe), primarily due to the labor-intensive of collecting and annotating 3D scenes. To overcome these obstacles, we con- struct SynVL3D, a comprehensive synthetic scene- text corpus with 10K indoor scenes and 1M de- scriptions at object, view, and room levels, which has the advantages of diverse scene data, rich tex- tual descriptions, multi-grained 3D-text associa- tions, and low collection cost. Utilizing the rich annotations in SynVL3D, we pre-train a simple and unified Transformer for aligning 3D and language with multi-grained pretraining tasks. Moreover, we propose a synthetic-to-real domain adaptation in downstream task fine-tuning process to address the domain shift. Through extensive experiments, we verify the effectiveness of our model design by achieving state-of-the-art performance on down- stream tasks including visual grounding, dense cap- tioning, and question answering.
@inproceedings{3DSyn,
title = {3D Vision and Language Pretraining with Large-Scale Synthetic Data},
author = {Dejie Yang, Zhu Xu, Wentao Mo, Qingchao Chen, Siyuan Huang, Yang Liu},
booktitle = {Proceedings of the Thirty-Second International Joint Conference on
Artificial Intelligence, {IJCAI-24}},
publisher = {International Joint Conferences on Artificial Intelligence Organization},
year = {2024},
}