OnlineX: Unified Online 3D Reconstruction and Understanding with Active-to-Stable State Evolution

Abstract

Recent advances in generalizable 3D Gaussian Splatting (3DGS) have enabled rapid 3D scene reconstruction within seconds, eliminating the need for per-scene optimization. However, existing methods primarily follow an offline reconstruction paradigm, lacking the capacity for continuous reconstruction, which limits their applicability to online scenarios such as robotics and VR/AR. In this paper, we introduce OnlineX, a feed-forward framework that reconstructs both 3D visual appearance and language fields in an online manner using only streaming images. A key challenge in online formulation is the cumulative drift issue, which is rooted in the fundamental conflict between two opposing roles of the memory state: an active role that constantly refreshes to capture high-frequency local geometry, and a stable role that conservatively accumulates and preserves the long-term global structure. To address this, we introduce a decoupled active-to-stable state evolution paradigm. Our framework decouples the memory state into a dedicated active state and a persistent stable state, and then cohesively fuses the information from the former into the latter to achieve both fidelity and stability. Moreover, we jointly model visual appearance and language fields and incorporate an implicit Gaussian fusion module to enhance reconstruction quality. Experiments on mainstream datasets demonstrate that our method consistently outperforms prior work in novel view synthesis and semantic understanding, showcasing robust performance across input sequences of varying lengths with real-time inference speed.

Method Overview

Overall architecture of OnlineX. Our framework features a two-stage, active-to-stable pipeline. First, the Relative Geometry Extractor processes consecutive frames to capture high-fidelity active relative information. The Anchor State Director then uses this local information to recurrently update its stable global state, yielding a globally consistent representation for the final output. The diagram illustrates this process for a single time step, which would be sequentially repeated for each frame in the input stream. Dashed lines represent information passed from the previous time step or carried over to the next.

More Results

BibTeX

@misc{xia2026onlinexunifiedonline3d,
        title={OnlineX: Unified Online 3D Reconstruction and Understanding with Active-to-Stable State Evolution}, 
        author={Chong Xia and Fangfu Liu and Yule Wang and Yize Pang and Yueqi Duan},
        year={2026},
        eprint={2603.02134},
        archivePrefix={arXiv},
        primaryClass={cs.CV},
        url={https://arxiv.org/abs/2603.02134}, 
  }

OnlineX: Unified Online 3D Reconstruction and Understanding with Active-to-Stable State Evolution

OnlineX: To enable online 3D reconstruction from streaming images, we propose a framework that jointly models visual appearance and language fields through a novel decoupled memory state evolution.

Abstract

Method Overview

More Results

BibTeX