Keep your ARMS and LEGS: Assuaging VRAM and Training Speed constraints in Language Embedded 3D Gaussian Splats

Project Overview

We present an optimization to the Foundation Model Embedded Gaussian Splatting (FMGS) pipeline for open-vocabulary 3D scene understanding. FMGS combines 3D Gaussian Splatting (3DGS) with foundation vision-language models like CLIP and DINO, but suffers from high VRAM usage and slow training due to its use of multi-resolution hash encodings (MHEs) passed into per-Gaussian MLP decoders.

To address this, we propose an alternative pipeline that renders the MHE directly into a feature field and then applies parameter-efficient CNNs to decode semantic features. This change leads to faster training, reduced memory usage, and more scalable inference, while preserving model accuracy.

📄 View Report (PDF)

Method

Figure 1. Our adjusted FMGS architecture

Our method follows the FMGS training pipeline, with the key change being the replacement of MLP-based per-Gaussian feature decoding with a rendered feature map + CNN approach:

Render MHE feature field across the image plane
Apply CLIP and DINO CNNs to the rendered feature maps
Compare CNN outputs against frozen foundation model features from CLIP and DINO
Use hybrid CLIP loss + DINO feature loss + pixel-alignment loss for supervision

This modification avoids storing per-Gaussian high-dimensional features and reduces GPU overhead, especially in large-scale scenes.

Results

1. Accuracy on Open-Vocabulary Object Detection

Our model achieves comparable or better accuracy than FMGS on multiple scenes from the LERF dataset when evaluated with open-vocabulary object queries. While the original FMGS achieves high accuracy, our model maintains similar performance despite architectural changes.

We evaluate open-vocabulary object detection accuracy on five scenes from the LERF dataset. A prediction is considered correct if the highest-scoring pixel from the relevancy map lies within the ground-truth bounding box.

Scene	LERF	FMGS	Ours
bouquet	83.3%	91.7%	100.0%
figurines	87.2%	79.5%	79.5%
ramen	62.5%	80.0%	82.5%
teatime	96.9%	87.5%	90.6%
kitchen	85.2%	DNF	72.2%
Average	83.0%	84.7%	84.96%

2. Computational Efficiency

We reduce training time by 37% and VRAM usage by 24% on average compared to FMGS, enabling faster experimentation and broader accessibility on consumer GPUs.

Figure 2. Runtime and memory usage comparison across five scenes. Our model is faster and more efficient than FMGS.

3. Qualitative Results

Our method produces relevancy maps that are spatially smoother and exhibit lower false activations in background regions, compared to FMGS. While FMGS often yields more intense responses, our approach better suppresses noise and maintains focus around queried objects.

Figure 3. Relevancy maps from FMGS (left) and our method (right) for the same query.

Jiayin Meng