Real-time multimodal 3D reconstruction with tactile-enhanced Gaussian splatting.
GaussianFeels is an online visuo-tactile reconstruction and tracking system built around an explicit object-centric 3D Gaussian map — updated under hand-induced occlusion, tracked when pose supervision is removed, and exported to manipulation from frame zero.
Why one camera isn't enough
The geometry that matters during in-hand manipulation is the geometry the camera can't see.
During grasping and reorientation, the surfaces a robot needs to reason about are exactly the ones occluded by the hand and fingertips. Tactile sensing measures those regions — but only over a small contact patch. A practical manipulation system needs both.
Dense observations of exposed surfaces. Fails behind the hand and the fingertips during contact.
Direct contact geometry at the manipulation interface. Tiny footprint — only a few square centimeters.
Forward-kinematic grasp center, finger poses, contact constraints. Used to seed and bound estimates.
An explicit object-centric Gaussian map can serve as the shared state for contact-rich manipulation: updated online from synchronized RGB-D, tactile, and proprioceptive observations — tracked directly when object-pose supervision is removed — and exposed immediately to downstream manipulation modules as a progressively improving object model.
Three problems, one representation
The thesis argues that prior systems pick a representation that solves one of these and forces a re-encode for the other two. Implicit fields render and supervise but don't export cleanly to a policy. Mesh-only pipelines export but can't be updated under contact. Point clouds update easily but render poorly. An explicit Gaussian state covers all three roles — and lets pose tracking sit in the middle of the loop.
Sparse Gaussian updates with contact-aware population management and an active-budget cap so the runtime stays online.
The frozen-map signed-distance field exposes a differentiable residual the Theseus optimizer can converge on under heavy occlusion.
PLY exports and provenance-labeled point clouds hand off to a policy — measured surfaces flagged separately from generated ones.
Object-centric Gaussian map · pose tracker · occlusion-aware loss
Four parts, all reconstructing the same object frame.
Sensor ingestion synchronizes RGB-D, segmentation, tactile contacts, and hand state. The Gaussian map lives in object coordinates and is synchronized to world only when rendering or supervision needs it. The pose tracker sits in the middle of the loop — it's the transform that links observations to the map.
2.1 · Frozen-map signed distance
Tracking samples object pixels from the current depth image, transforms those points into the world frame, and minimizes a signed-distance residual against a frozen anchor cloud (qi, ni) sampled from recent keyframes. Weights fall off Gaussian-like with distance; the result is a smooth analogue of a point-cloud SDF.
q̃(p) = Σᵢ wi qi / Σᵢ wi(4.5)
ñ(p) = Σᵢ wi ni / ‖·‖(4.6)
dM(p) = (p − q̃(p))ᵀ ñ(p)(4.7)
2.2 · Pose objective
For frames after the seed, the runtime samples object pixels, lifts them, and optimizes only the latest pose in a sliding window. The objective combines a frozen-map residual on camera and tactile points with temporal and ICP priors:
+ λtr ‖tt − tt-1‖² + λrot ‖log(Rt-1ᵀ Rt)‖²
+ λicp,t ‖tt − t̂icpt‖² + λicp,r ‖log(R̂icpᵀ Rt)‖²(4.8)
2.3 · Occlusion-aware reweighting
RGB and depth residuals become misleading when the hand fills a large fraction of the object view. The trainer monitors an occlusion ratio ρocc from a dilated foreground-and-edge mask and reweights the four core supervision channels online:
λ′rgb = svis λrgb , λ′depth = svis λdepth(4.15–16)
λ′tactile = (1 + β ρocc) λtactile(4.17)
With wmin = 0.2 and β = 2, tactile supervision overtakes the unoccluded visual baseline at ρocc ≈ 0.4 — the point at which the hand starts dominating the camera view.
2.4 · Manipulation-side completion
A second process bootstraps a frame-0 estimate from a single RGBA crop using Hunyuan3D-2-mini, an orientation-variant search, and a registration stage that solves the model-to-object transform. As the SLAM loop accumulates measurements, generated geometry is progressively replaced by measured geometry — with provenance preserved so the policy knows which surfaces are real.
Per-frame runtime
Two processes, one shared object frame.
SLAM runs on one GPU, manipulation-side completion on another. The frame-0 payload bootstraps a Hunyuan3D-2-mini prior; later frames progressively replace generated geometry with measured geometry as the episode unfolds.
Figure. Two-process runtime. Solid arrows are within-process; dashed arrows are inter-process queue messages. The pose mode strip across the bottom describes which channel feeds TtWO — the rest of the SLAM loop is identical across modes.
Live reconstructions · animated
What the system actually does, frame by frame.
Each tile below is an interactive simulation of a different stage of the runtime — drag the timeline to scrub. Real video captures from FeelSight-Sim, FeelSight-Real, and FeelSight-Occlusion will replace these tiles after the final experiment sweep (see status section).
Note. These visualizations recreate the qualitative behavior described in Chapters 4–5 of the thesis. Final videos — rendered from saved PLY exports of the 40-trial benchmark — will be linked here once the run completes.
Reconstruction quality · pose tracking · shape priors
How the system holds up under the saved 40-trial sweep.
Three benchmark groups: (a) reconstruction quality on FeelSight-{Sim, Real, Occlusion}, (b) pose-tracking stability under three pose modes, and (c) frame-0 image-to-3D priors evaluated by aligned F-score at 5 mm.
5.1 · Reconstruction quality (FeelSight family)
F-5 mm ↑ · ADD-S ↓ · runtime FPS ↑ · fraction of trials reaching ≥ 50% measured surface ↑
| Variant | Pose mode | F@5 mm ↑ | ADD-S (mm) ↓ | FPS ↑ | Measured ≥50% ↑ |
|---|---|---|---|---|---|
| FeelSight-Sim | map | — pending — | — | — | — |
| FeelSight-Sim | true_slam | — | — | — | — |
| FeelSight-Real | slam | — | — | — | — |
| FeelSight-Real | true_slam | — | — | — | — |
| FeelSight-Occlusion | slam | — | — | — | — |
| FeelSight-Occlusion | true_slam | — | — | — | — |
Table 6.1. Reserved table for the final SLAM sweep. Numbers will be filled in from the 40-trial Optuna baseline (see Appendix A.1 of the thesis).
5.2 · Frame-0 shape-prior comparison 40 trials · F@5 mm
Below is the partial benchmark already tabulated in the thesis manuscript. Hunyuan3D-2-mini is the deployed prior for the manipulation branch; the others are baselines.
| Method | F@5 mm (mean) | F@5 mm (median) | Wins on cuboid | Wins on articulated | Used in pipeline |
|---|---|---|---|---|---|
| Hunyuan3D-2-mini deployed | 0.689 | 0.720 | 24/40 | 31/40 | ✓ frame-0 prior |
| InstantMesh | 0.639 | 0.651 | 29/40 | 8/40 | baseline |
| Fast-SAM3D | 0.633 | 0.648 | — | — | baseline |
| TripoSR | 0.576 | 0.590 | — (high F, blob) | — | baseline |
| RGB2Point | 0.484 | 0.495 | — | — | baseline |
| Geco | 0.400 | 0.411 | — | — (collapses) | baseline |
Table 6.3. The benchmark cannot be interpreted from F-score alone — TripoSR can outscore Hunyuan3D-2-mini on a structured cuboid trial while still rounding the object into a generic blob. Qualitative win-rate on category-faithful shapes is reported alongside F-score.
5.3 · Effect of frame-0 prior on downstream manipulation
The manipulation branch hands a provenance-labeled object cloud to a downstream policy from t = 0. The benchmark evaluates whether the early prior helps or hurts policy success when measured geometry is still sparse.
frame-0 prior on vs. off, FeelSight-Real
once measured fraction ≥ 80%
time to first stable grasp
between generated and measured surfaces
Table 6.5 (reserved). Cards are placeholders — they will be replaced by the final manipulation-prior ablation once policy rollouts are available.
Where the system breaks — and why
Honest failures from pilot runs.
Failure cases matter as much as benchmarks. The thesis catalogues three recurring modes; reproducing the qualitative behavior here lets future work target the right thing.
F1 · Thin / wire-like objects
map populationAnisotropic Gaussians can't shrink below the contact-spawn scale floor; wire-like geometry gets covered by splats biased outside the GT line. Tactile contacts help locally but can't carry the rest of the object.
F2 · ICP gate rejects large motion
trackingWhen inter-frame translation exceeds 50 mm, the ICP prior is rejected and the tracker falls back to the temporal prior alone. On rapid hand re-grasps this leaves the optimizer with too small a basin and pose lags by 1–2 frames.
F3 · Mask drift around DIGIT glow
segmentationMobileSAM occasionally swaps the object mask for the bright DIGIT sensor glow. Median-area selection and the 60 px negative-prompt filter mitigate but don't fully solve it; outliers from drift events propagate into spawn decisions.
Section 7.2 (Limitations) lists each of these together with the implementation hyperparameter that controls it — spawn-scale floor for F1, the four ICP gates for F2, and the MobileSAM area-selection rule for F3 — and Section 7.3 (Future Work) maps each to a concrete next step.
Where the thesis is right now
Manuscript locked, experiments running, defense scheduled.
The thesis is in its final experimental sweep. The companion site updates as results land. Downloadable PDF will appear here after the defense.
Timeline
Proposal accepted
Visuo-tactile reconstruction with explicit Gaussian map — problem formulation locked.
SLAM runtime feature-complete
Theseus pose optimizer, frozen-map SDF, occlusion-aware reweighting, contact-aware spawn.
Manipulation branch landed
Hunyuan3D-2-mini frame-0 prior, orientation-variant search, progressive replacement.
Final experiment sweep
40-trial Optuna baseline running on FeelSight-{Sim, Real, Occlusion}; benchmarks below populate as runs finish.
Defense
Committee review and oral defense at SNU Mechanical Engineering.
Final PDF + deposit
Bound deposit, repository freeze, downloadable thesis link goes live on this page.