Summary: | As large language models (LLMs) evolve, their integration with 3D spatial data (3D-LLMs) has seen rapid progress, offering
unprecedented capabilities for understanding and interacting with physical spaces. This survey provides a comprehensive overview of
the methodologies enabling LLMs to process, understand, and generate 3D data. Highlighting the unique advantages of LLMs, such as
in-context learning, step-by-step reasoning, open-vocabulary capabilities, and extensive world knowledge, we underscore their
potential to significantly advance spatial comprehension and interaction within embodied Artificial Intelligence (AI) systems. Our
investigation spans various 3D data representations, from point clouds to Neural Radiance Fields (NeRFs). It examines their
integration with LLMs for tasks such as 3D scene understanding, captioning, question-answering, and dialogue, as well as LLM-based
agents for spatial reasoning, planning, and navigation. The paper also includes a brief review of other methods that integrate 3D and
language. The meta-analysis presented in this paper reveals significant progress yet underscores the necessity for novel approaches
to harness the full potential of 3D-LLMs. Hence, with this paper, we aim to chart a course for future research that explores and expands
the capabilities of 3D-LLMs in understanding and interacting with the complex 3D world. To support this survey, we have established a
project page where papers related to our topic are organized and listed: https://github.com/ActiveVisionLab/Awesome-LLM-3D.
|