The Chronological Ascent to a Spatially Grounded World Model: Merging Geometric Architectures with Large Language Models
Charlotte Sennersten, Department of Computer Science, Kristianstad University (Sweden)
Kamilla Klonowska, Department of Computer Science, Kristianstad University (Sweden)
Abstract
The ambition to create a holistic computational model capable of comprehending and interacting with the 3D physical world—a true World Model—has necessitated the convergence of geometric data processing and advanced linguistic reasoning. This article chronologically charts the foundational scientific and engineering contributions that culminated in the proposed 3D-LLM architecture. Starting with the conceptualization of volumetric spatial indexing in 2016 and the first end-to-end deep learning approach for 3D feature extraction in 2017, the research progressed through the optimization of sparse perception via VoxelNeXt in 2022 and the parallel development of specialized knowledge representation using the Galactica LLM. Critically, the discussion around Digital Twins (DTs) in 2023/2024 provided the necessary framework for applying mathematical rigor to these personalized, complex systems. The ultimate proposed contribution, 3D-LLMs (2025), synthesizes these pillars by using dedicated localization tokens to enable natural language querying and reasoning over x, y, z space. This evolutionary path demonstrates how discrete innovations in spatial indexing, perceptual efficiency, structural tokenization, and mathematical grounding are combined to forge a powerful, spatially aware World Model.
|
Keywords |
3D-LLM, Voxelization, Digital Twin, Spatial Grounding, World Model, Galactica, VoxelNeXt. |
|
REFERENCES |
[1] Sennersten, Charlotte, Davie, Andrew, and Lindley, Craig. (2016). VoxelNET - An Agent Based System for Spatial Data Analytics”, COGNITIVE 2016, The Eight International Conference on Advanced Cognitive Technologies and Applications, Rome, Italy. [2] Zhou, Yin & Tuzel, Oncel. (2017). VoxelNet: End-to-End Learning for Point Cloud Based 3D Object Detection. 10.48550/arXiv.1711.06396. [3] Chen, Yukang & Liu, Jianhui & Zhang, Xiangyu & Qi, Xiaojuan & Jia, Jiaya. (2023). VoxelNeXt: Fully Sparse VoxelNet for 3D Object Detection and Tracking. 10.48550/arXiv.2303.11301. [4] Taylor, Ross, Marcin Kardas, Guillem Cucurull, Thomas Scialom, Anthony S. Hartshorn, Elvis Saravia, Andrew Poulton, Viktor Kerkez and Robert Stojnic. “Galactica: A Large Language Model for Science.” ArXiv abs/2211.09085 (2022): n. pag. [5] Hong, Yining & Zhen, Haoyu & Chen, Peihao & Zheng, Shuhong & Du, Yilun & Chen, Zhenfang & Gan, Chuang. (2023). 3D-LLM: Injecting the 3D World into Large Language Models. 10.48550/arXiv.2307.12981. [6] Sennersten, Charlotte, Evans, Ben and Lindley, Craig. (2019), VoxelNET’s Geo-Located Spatio Temporal Softbots, COGNITIVE 2019, The Eleventh International Conference on Advanced Technologies and Applications, Venice, Italy. [7] Antil, Harbir. “Mathematical Opportunities in Digital Twins (MATH-DT).” ArXiv abs/2402.10326 (2024): n. pag. |
New Perspectives in Science Education




























