GenSplat: Bridging the Generalization Gap
in 3DGS Language Comprehension

Fang Liu^*, Yuhao Liu^*, Ke Xu^†, Gerhard Hancke, Rynson W.H. Lau^†

City University of Hong Kong

CVPR 2026

^*Equal Contribution ^†Joint Corresponding Authors

Abstract

In this paper, we propose GenSplat, a novel approach for language comprehension in 3D Gaussian Splatting (3DGS). Unlike previous methods that either achieve cross-scene generalization by being bounded to a predefined vocabulary or handle free-form language by overfitting to individual scenes, GenSplat is robust to free-form language queries and generalizable across 3DGS scene representations. Our key insight is to formulate a structured learning process to progressively align linguistic concepts with 3D Gaussians. It contains two novel technical contributions. First, we propose a Progressive Language Grounding Curriculum that structurally guides the model through learning semantic-level representations to instance-level concepts and free-form language, preventing overfitting by building a generalizable language feature space. Second, we design a Multi-modal Large Language Models (MLLM)-guided Reasoning Module that leverages MLLM's semantic and spatial priors to enhance 3D localization and reasoning. To further improve spatial alignment and computational efficiency, we introduce a Geometry-Aware Frame Selector (GAFS), which adaptively selects the most informative views based on Gaussian and textural cues. Extensive cross-task evaluations (including 3D referring segmentation, 3D visual question answering, and 3D open-vocabulary understanding) demonstrate state-of-the-art performances and strong generalization capability of GenSplat.

Method Overview

Overview of the GenSplat framework. Given multi-view RGB images and a text query (e.g., for Referring Segmentation or VQA), GenSplat first reconstructs a 3D Gaussian representation and extracts semantic Gaussian features via the Gaussian Encoder, based on which the Instance Decoder produces instance queries for 3D segmentation. Meanwhile, a Geometry-Aware Frame Selector (GAFS) adaptively selects informative keyframes. The text and selected frames are processed by a MLLM to predict a reasoning token for referring segmentation (or generate answering tokens for VQA). The generated token is concatenated with instance queries and fed into the Referring Decoder to generate the corresponding 3D Gaussian mask.

Qualitative Results

Qualitative visualization on ScanRefer and SQA3D datasets. We showcase referring segmentation results on both scenes (a) and (b), and additionally present a question answering example on scene (b). For complex free-form queries, GenSplat reliably locates the target object, showcasing strong reasoning ability. In contrast, 2D-based methods (Grounded-SAM) and per-scene optimization approaches fail under these challenging scenarios.

Complex query results. GenSplat handles complex free-form language queries that require compositional reasoning and spatial understanding.

GenSplat: Bridging the Generalization Gapin 3DGS Language Comprehension

Abstract

Method Overview

Qualitative Results

BibTeX

GenSplat: Bridging the Generalization Gap
in 3DGS Language Comprehension