Receptive field based representations for video-analysis and how to build invariant deep neural networks

Time: Thu 2018-10-25 13.15 - 15.00

Location: Room 4423, Lindstedtsvägen 5, KTH

Participating: Ylva Jansson, CST/EECS/KTH

Abstract

A major source of difficulty in computer vision is the widely varying appearance of objects under identity preserving visual transformations such as translations, scalings, rotations, nonlinear perspective transformations and illumination transformations. Encoding prior information about visual transformations in learning algorithms or visual features can enable faster learning from fewer samples, increased understanding of the properties of the representation and better generalisation.

The idea underlying scale space theory, is to impose such structural constraints on the first stages of visual processing, which leads to a normative theory of spatial, spatio-temporal and spatio-chromatic receptive fields [3]. Spatial receptive fields based on the Gaussian scale-space concept have been demonstrated to be a powerful front-end for solving a large range of visual tasks.

For real-time processing of time-varying visual stimuli a new family of time-causal and time-recursive spatio-temporal receptive fields was recently presented in [4].

My first project, [1], [2] concerns a first evaluation of using these time-causal spatio-temporal receptive fields as primitives for video analysis. We propose a new family of video descriptors based on regional statistics of spatio-temporal receptive field responses and evaluate this approach on the problem of dynamic texture recognition. We show improved performance compared to a large range of similar methods using different primitives either handcrafted or learned from data, indicating that the structural assumption underlying the receptive field family are indeed useful.

In the last few years, deep learning have emerged as the new state-of-the-art for a large range of visual tasks. However, the development has been predominantly experimental and properties of the learned representation are not yet fully understood. Deep neural networks e.g. show a peculiar sensitivity to imperceptible perturbations (adversarial examples). For my second project I will present some initial results on invariance of CNN features for a patch matching task. I will end by outline the future research agenda concerning how the basic ideas of invariant and covariant representations can be integrated with learning deep representations from data.

[1] Jansson, Ylva och Lindeberg, Tony (2018) “Dynamic texture recognition using time-causal and time-recursive spatio-temporal receptive fields”, Journal of Mathematical Imaging and Vision, https://doi.org/10.1007/s10851-018-0826-9

[2] Jansson, Ylva and Lindeberg, Tony (2017) "Dynamic texture recognition using time-causal spatio-temporal scale-space filters", Proc. SSVM2017: Sixth International Conference on Scale Space and Variational Methods in Computer Vision, Kolding, Denmark, June 4-8, 2017. Springer LNCS vol 10302, p. 16-28

[3] Lindeberg, Tony (2017) "Normative theory of visual receptive fields." arXiv preprint arXiv:1701.06333.

[4] Lindeberg, Tony (2016) ”Time-causal and time-recursive spatio-temporal receptive fields." Journal of Mathematical Imaging and Vision 55(1): 50-88.

To the calendar

Studies

Research

Co-operation

About KTH

Library

Receptive field based representations for video-analysis and how to build invariant deep neural networks

Abstract