Capturing the Shape and Pose of Horses in 3D
Time: Mon 2025-01-13 14.00
Location: F3 (Flodis), Lindstedtsvägen 26 & 28, Stockholm
Video link: https://kth-se.zoom.us/j/66272186963
Language: English
Subject area: Computer Science
Doctoral student: Ci Li , Robotik, perception och lärande, RPL
Opponent: Associate Professor Tilo Burghardt, University of Bristol
Supervisor: Hedvig Kjellström, Robotik, perception och lärande, RPL; Silvia Zuffi, IMATI-CNR; Elin Hernlund, Swedish University of Agricultural Sciences, SLU
QC 20241129
Abstract
Animals play a significant role in the Earth's ecology and have lived alongside humans throughout history. Studying and understanding their movements and behaviors is important for advancing scientific knowledge and benefiting practical applications. In this thesis, we focus specifically on horses, which are key subjects in both computer vision and biological research due to their strength in speed and unique locomotion systems.
Traditional systems for capturing horse motion often rely on attaching sensors or markers to the horse's body. These systems, however, are often limited to constrained environments and difficult to use in natural, unconstrained settings. In contrast, capturing horses using standard video cameras, where horses are observed in their natural environments, presents a more practical solution. However, capturing horses in 3D, specifically the 3D shape and pose, from 2D images is a highly challenging problem due to the ambiguity with only 2D data.
To address these challenges, we propose model-based methods to capture the 3D shape and pose of horses from monocular images or videos. We start by presenting hSMAL, a horse-specific 3D parameterized model, capable of expressing diverse horse shapes, which is learned from 3D scan data. We also demonstrate the practical utility of this model in lameness detection, a critical veterinary task for assessing the well-being of horses. Additionally, we present a comprehensive horse motion dataset, collecting data from horses of varying shapes and performing diverse movements, using dense motion capture markers. This motion capture data allows us to animate hSMAL with real horse movements, providing details about how horses move and also tackling the common issue of limited data in animal research.
Building on the proposed model and the dataset, we develop data-driven regression methods, to capture horses in 3D from monocular images and videos in an end-to-end manner. First, we integrate multimodal data, combining video clips and audio. Our findings show that incorporating audio enhances the robustness of the method, especially in situations of visual ambiguity and occlusion. Second, we integrate vision foundation models and disentanglement learning with an on-the-fly synthetic data generation pipeline. The pipeline allows the creation of paired data during network training, facilitating the learning of disentangled feature spaces. Together, these approaches enhance the generalization and adaptability of the method, improving performance on images from various domains and other four-legged animals. Through experiments on both our own collected datasets and public datasets, we demonstrate the effectiveness of the proposed methods in advancing horse-specific capture from monocular images and videos.
This thesis contributes methodologies for capturing horses from standard video cameras, specifically focusing on the 3D shape and pose, opening new possibilities for animal motion capture and analysis.