Skip to main content
To KTH's start page

Improving Spatial Understanding Through Learning and Optimization

Time: Fri 2025-12-05 13.00

Location: F3 (Flodis), Lindstedtsvägen 26 & 28, Campus

Video link: https://kth-se.zoom.us/s/65134312330

Language: English

Doctoral student: Leonard Bruns , Robotik, perception och lärande, RPL

Opponent: Professor Stefan Leutenegger, ETH Zürich, Zürich, Switzerland

Supervisor: Professor Patric Jensfelt, Robotik, perception och lärande, RPL

Export to calendar

QC 20251106

Abstract

Spatial understanding comprises various abilities from pose estimation of objects and cameras within a scene to shape completion given partial observations. These abilities are what enable humans to intuitively navigate and interact with the world. Despite significant progress in large-scale learning, computers still lack the same intuitive spatial understanding of humans. In robotics, this lack of abilities implies limited applicability of classical robotics pipelines in real-world environments and in augmented reality it limits achievable fidelity as well as the interaction of virtual content with real-world objects.

This thesis investigates ways to improve spatial understanding of computers using different learning- and optimization-based techniques. Learning-based methods are employed to learn useful priors about the objects and the 3D world, whereas optimization-based techniques are used to find models of objects and scenes aligning well with a set of observations. Within this framework, we investigate and propose methods for three subproblems of spatial understanding.

First, we propose a modular framework for categorical object pose and shape estimation, which combines a pre-trained generative shape model with a discriminative initialization network which regresses an initial pose and latent shape description from a partial point cloud of an object. By combining the generative shape model with a differentiable renderer we further perform iterative, joint pose and shape optimization from one or multiple views. Our approach outperforms existing methods especially on unconstrained orientations, while achieving competitive results for upright, tabletop objects.

Second, we investigate the use of neural fields for dense, volumetric mapping. Specifically, we propose to represent the scene by a set of spatially constrained, movable neural fields anchored to a pose graph. We formulate the optimization problem of the multi-field scene representation as independent optimization of each field demonstrating that this approach allows real-time loop closure integration, avoids transition artifact at field boundaries, and outperforms current neural-field-based SLAM systems on larger scenes in which significant drift can accumulate.

Third, we investigate large-scale pre-training for visual relocalization using scene coordinate regression. We split up the scene-specific regressor into a scene-agnostic regressor and a scene-specific latent map code, and propose a pre-training scheme for the scene-agnostic coordinate regressor to better generalize from mapping images to query images containing different viewpoints, lighting changes, and objects. We demonstrate that our approach outperforms existing methods under such dynamic mapping-query splits.

urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-372393