M.S. Thesis Spring 2026 Seoul National University

Real-time multimodal 3D reconstruction with tactile-enhanced Gaussian splatting.

GaussianFeels is an online visuo-tactile reconstruction and tracking system built around an explicit object-centric 3D Gaussian map — updated under hand-induced occlusion, tracked when pose supervision is removed, and exported to manipulation from frame zero.

Read the method → Conference poster → Thesis PDF

Author

Krishi AttriDepartment of Mechanical Engineering

Advisor

Prof. Yong-Lae ParkSoft Robotics & Bionics Lab

Submission

Spring 2026Seoul National University

Keywords

tactile sensing RGB-D 3D Gaussian Splatting SLAM manipulation

01 / Problem

Why one camera isn't enough

The geometry that matters during in-hand manipulation is the geometry the camera can't see.

During grasping and reorientation, the surfaces a robot needs to reason about are exactly the ones occluded by the hand and fingertips. Tactile sensing measures those regions — but only over a small contact patch. A practical manipulation system needs both.

Modality A · global

RGB-Dcamera

Dense observations of exposed surfaces. Fails behind the hand and the fingertips during contact.

Modality B · local

DIGITtactile

Direct contact geometry at the manipulation interface. Tiny footprint — only a few square centimeters.

Joint state

Handproprio.

Forward-kinematic grasp center, finger poses, contact constraints. Used to seed and bound estimates.

Research claim

An explicit object-centric Gaussian map can serve as the shared state for contact-rich manipulation: updated online from synchronized RGB-D, tactile, and proprioceptive observations — tracked directly when object-pose supervision is removed — and exposed immediately to downstream manipulation modules as a progressively improving object model.

Three problems, one representation

The thesis argues that prior systems pick a representation that solves one of these and forces a re-encode for the other two. Implicit fields render and supervise but don't export cleanly to a policy. Mesh-only pipelines export but can't be updated under contact. Point clouds update easily but render poorly. An explicit Gaussian state covers all three roles — and lets pose tracking sit in the middle of the loop.

Online state

Sparse Gaussian updates with contact-aware population management and an active-budget cap so the runtime stays online.

Tracking reference

The frozen-map signed-distance field exposes a differentiable residual the Theseus optimizer can converge on under heavy occlusion.

Manipulation export

PLY exports and provenance-labeled point clouds hand off to a policy — measured surfaces flagged separately from generated ones.

02 / Method

Object-centric Gaussian map · pose tracker · occlusion-aware loss

Four parts, all reconstructing the same object frame.

Sensor ingestion synchronizes RGB-D, segmentation, tactile contacts, and hand state. The Gaussian map lives in object coordinates and is synchronized to world only when rendering or supervision needs it. The pose tracker sits in the middle of the loop — it's the transform that links observations to the map.

Figure 4.1 (this site). Per-frame data flow. Pose tracking links the synchronized observations Ft to the Gaussian state in object coordinates; supervision is computed in world frame and reweighted by the occlusion ratio.

2.1 · Frozen-map signed distance

Tracking samples object pixels from the current depth image, transforms those points into the world frame, and minimizes a signed-distance residual against a frozen anchor cloud (qi, ni) sampled from recent keyframes. Weights fall off Gaussian-like with distance; the result is a smooth analogue of a point-cloud SDF.

w i (p) = exp(- ‖p - q i ‖² / 2σ²) (4.4) q̃(p) = Σᵢ w i q i / Σᵢ w i (4.5) ñ(p) = Σᵢ w i n i / ‖\cdot‖ (4.6) d M (p) = (p - q̃(p))ᵀ ñ(p) (4.7)

2.2 · Pose objective

For frames after the seed, the runtime samples object pixels, lifts them, and optimizes only the latest pose in a sliding window. The objective combines a frozen-map residual on camera and tactile points with temporal and ICP priors:

min T t WO Σ p \in Pcam d M (T t OW p)² + λ tac Σ p \in Ptac d M (T t OW p)² + λ tr ‖t t - t t-1 ‖² + λ rot ‖log(R t-1 ᵀ R t)‖² + λ icp,t ‖t t - t̂ icp t ‖² + λ icp,r ‖log(R̂ icpᵀ R t)‖² (4.8)

2.3 · Occlusion-aware reweighting

RGB and depth residuals become misleading when the hand fills a large fraction of the object view. The trainer monitors an occlusion ratio ρ_occ from a dilated foreground-and-edge mask and reweights the four core supervision channels online:

s vis = 1 - ρ occ (1 - w min) (4.14) λ' rgb = s vis λ rgb, λ' depth = s vis λ depth (4.15-16) λ' tactile = (1 + β ρ occ) λ tactile (4.17)

With w_min = 0.2 and β = 2, tactile supervision overtakes the unoccluded visual baseline at ρ_occ ≈ 0.4 — the point at which the hand starts dominating the camera view.

2.4 · Manipulation-side completion

A second process bootstraps a frame-0 estimate from a single RGBA crop using Hunyuan3D-2-mini, an orientation-variant search, and a registration stage that solves the model-to-object transform. As the SLAM loop accumulates measurements, generated geometry is progressively replaced by measured geometry — with provenance preserved so the policy knows which surfaces are real.

03 / Pipeline

Per-frame runtime

Two processes, one shared object frame.

SLAM runs on one GPU, manipulation-side completion on another. The frame-0 payload bootstraps a Hunyuan3D-2-mini prior; later frames progressively replace generated geometry with measured geometry as the episode unfolds.

Figure. Two-process runtime. Solid arrows are within-process; dashed arrows are inter-process queue messages. The pose mode strip across the bottom describes which channel feeds T^t_WO — the rest of the SLAM loop is identical across modes.

04 / Experiments

Live reconstructions · animated

What the system actually does, frame by frame.

Each tile below is an interactive simulation of a different stage of the runtime — drag the timeline to scrub. Real video captures from FeelSight-Sim, FeelSight-Real, and FeelSight-Occlusion will replace these tiles after the final experiment sweep (see status section).

Online Gaussian accumulation · FeelSight-Sim

240 frames · sparse spawn + tactile boost

Pose tracking under occlusion · FeelSight-Occlusion

SLAM mode · ρocc oscillating · tactile-dominant

Progressive replacement · manipulation branch

orange = generated prior · blue = measured

Frame-0 orientation search

staged coord-descent · 15°→8°→3°→1°

Note. These visualizations recreate the qualitative behavior described in Chapters 4–5 of the thesis. Final videos — rendered from saved PLY exports of the 40-trial benchmark — will be linked here once the run completes.

05 / Benchmarks

Reconstruction quality · pose tracking · shape priors

How the system holds up under the saved 40-trial sweep.

Three benchmark groups: (a) reconstruction quality on FeelSight-{Sim, Real, Occlusion}, (b) pose-tracking stability under three pose modes, and (c) frame-0 image-to-3D priors evaluated by aligned F-score at 5 mm.

5.1 · Reconstruction quality (FeelSight family)

F-5 mm ↑ · ADD-S ↓ · runtime FPS ↑ · fraction of trials reaching ≥ 50% measured surface ↑

Variant	Pose mode	F@5 mm ↑	ADD-S (mm) ↓	FPS ↑	Measured ≥50% ↑
FeelSight-Sim	map	— pending —	—	—	—
FeelSight-Sim	true_slam	—	—	—	—
FeelSight-Real	slam	—	—	—	—
FeelSight-Real	true_slam	—	—	—	—
FeelSight-Occlusion	slam	—	—	—	—
FeelSight-Occlusion	true_slam	—	—	—	—

Table 6.1. Reserved table for the final SLAM sweep. Numbers will be filled in from the 40-trial Optuna baseline (see Appendix A.1 of the thesis).

5.2 · Frame-0 shape-prior comparison 40 trials · F@5 mm

Below is the partial benchmark already tabulated in the thesis manuscript. Hunyuan3D-2-mini is the deployed prior for the manipulation branch; the others are baselines.

Method	F@5 mm (mean)	F@5 mm (median)	Wins on cuboid	Wins on articulated	Used in pipeline
Hunyuan3D-2-mini deployed	0.689	0.720	24/40	31/40	✓ frame-0 prior
InstantMesh	0.639	0.651	29/40	8/40	baseline
Fast-SAM3D	0.633	0.648	—	—	baseline
TripoSR	0.576	0.590	— (high F, blob)	—	baseline
RGB2Point	0.484	0.495	—	—	baseline
Geco	0.400	0.411	—	— (collapses)	baseline

Table 6.3. The benchmark cannot be interpreted from F-score alone — TripoSR can outscore Hunyuan3D-2-mini on a structured cuboid trial while still rounding the object into a generic blob. Qualitative win-rate on category-faithful shapes is reported alongside F-score.

5.3 · Effect of frame-0 prior on downstream manipulation

The manipulation branch hands a provenance-labeled object cloud to a downstream policy from t = 0. The benchmark evaluates whether the early prior helps or hurts policy success when measured geometry is still sparse.

Success @ t = 0.25 N

— %

frame-0 prior on vs. off, FeelSight-Real

Success @ t = N

— %

once measured fraction ≥ 80%

Recovery time

—frames

time to first stable grasp

Policy disagreement

—

between generated and measured surfaces

Table 6.5 (reserved). Cards are placeholders — they will be replaced by the final manipulation-prior ablation once policy rollouts are available.

06 / Failure cases

Where the system breaks — and why

Honest failures from pilot runs.

Failure cases matter as much as benchmarks. The thesis catalogues three recurring modes; reproducing the qualitative behavior here lets future work target the right thing.

F1 · Thin / wire-like objects

map population

Anisotropic Gaussians can't shrink below the contact-spawn scale floor; wire-like geometry gets covered by splats biased outside the GT line. Tactile contacts help locally but can't carry the rest of the object.

F2 · ICP gate rejects large motion

tracking

When inter-frame translation exceeds 50 mm, the ICP prior is rejected and the tracker falls back to the temporal prior alone. On rapid hand re-grasps this leaves the optimizer with too small a basin and pose lags by 1–2 frames.

F3 · Mask drift around DIGIT glow

segmentation

MobileSAM occasionally swaps the object mask for the bright DIGIT sensor glow. Median-area selection and the 60 px negative-prompt filter mitigate but don't fully solve it; outliers from drift events propagate into spawn decisions.

What the thesis says about these

Section 7.2 (Limitations) lists each of these together with the implementation hyperparameter that controls it — spawn-scale floor for F1, the four ICP gates for F2, and the MobileSAM area-selection rule for F3 — and Section 7.3 (Future Work) maps each to a concrete next step.

07 / Status

Where the thesis is right now

Manuscript locked, experiments running, defense scheduled.

The thesis is in its final experimental sweep. The companion site updates as results land. Downloadable PDF will appear here after the defense.

Timeline

Aug 2025

Proposal accepted

Visuo-tactile reconstruction with explicit Gaussian map — problem formulation locked.

Dec 2025

SLAM runtime feature-complete

Theseus pose optimizer, frozen-map SDF, occlusion-aware reweighting, contact-aware spawn.

Feb 2026

Manipulation branch landed

Hunyuan3D-2-mini frame-0 prior, orientation-variant search, progressive replacement.

Apr → May 2026

Final experiment sweep

40-trial Optuna baseline running on FeelSight-{Sim, Real, Occlusion}; benchmarks below populate as runs finish.

Jun 2026

Defense

Committee review and oral defense at SNU Mechanical Engineering.

Jul 2026

Final PDF + deposit

Bound deposit, repository freeze, downloadable thesis link goes live on this page.

Build progress

Manuscript94%

SLAM runtime100%

Manipulation branch100%

Experiment sweep61%

Figures + tables (final)42%

Companion site85%

Downloads

M.S. Thesis

thesis.pdf

available after defense — Spring 2026

— Pending —

Conference paper

gaussianfeels.pdf

available after submission

— Pending —

Code & checkpoints

github.com/KrishiAttriSNU/GaussianFeels

production codebase — SLAM + manipulation branch

View code →