Skip to main content
To KTH's start page To KTH's start page

Receptive field based representations for time-causal video-analysis

Time: Thu 2017-06-01 11.00 - 12.00

Location: Lindstedtsvägen 5, 4423

Participating: Ylva Jansson

Export to calendar

By imposing structural constraints reflecting known properties of the world on the first stages of visual processing, a normative theory of spatial, spatio-temporal and spatio-chromatic receptive fields can be derived [1]. For real-time processing of time-varying visual stimuli a new family of time-causal spatio-temporal receptive fields was presented in [2] that enables true time causal processing (i. e. filters than do not extend into the future), better response properties and an efficient time-recursive implementation. My research topic concerns design and evaluation of video analysis methods based on time-causal spatio-temporal receptive fields and I will present results from two projects:

(i) Dynamic texture recognition using time-causal spatio-temporal scale-space filters

Our first application of time-causal spatio-temporal receptive fields to video analysis, concerns dynamic texture recognition (intuitively DT = “texture + motion”). For this purpose a new family of video descriptors based on regional statistics of time-causal spatio-temporal receptive field responses have been developed which generalise a previously used method, based on joint histograms of receptive field responses, from the spatial to the spatio-temporal domain. I will present this new descriptor family and a comparison to state of the art dynamic texture recognition methods as well as qualitative and quantitative effects from design choices such as utilising different receptive field groups; spatial and temporal scales; number of histogram bins; degree of dimensionality reduction etc (early results are published in [3]).

(ii) Invariance properties of CNN features

As opposed to for axiomatically derived image representations, transformation properties for convolutional neural network (CNN) features are not yet fully known. Since such “deep features” are to an increasing degree used as general purpose visual primitives, a better understanding of their properties is of both practical and theoretical interest. It is clear that convolutional neural networks are not by construction invariant or covariant to general natural image transformations; but training on very large image databases together with data augmentation/jittering aim to make the network robust to at least some degree of transformation of the input image. But how invariant are such representations really? I will show empirical results from a pre-study on robustness of the internal representation of CNNs when the input image is subject to affine image transformations, as well as how such transformations affect performance on a patch matching task.

I will conclude by outlining future directions of research concerning possible applications for time-causal spatio-temporal receptive fields in video analysis, including the possibilities to combine this framework with learning from data.

References:

[1] Lindeberg, Tony. (2017) "Normative theory of visual receptive fields." arXiv preprint arXiv:1701.06333.

[2] Lindeberg, Tony. (2016) ”Time-causal and time-recursive spatio-temporal receptive fields." Journal of Mathematical Imaging and Vision 55(1): 50-88.

[3] Jansson, Ylva and Lindeberg, Tony (2017) "Dynamic texture recognition using time-causal spatio-temporal scale-space filters", Proc. SSVM2017: Sixth International Conference on Scale Space and Variational Methods in Computer Vision, Kolding, Denmark, June 4-8, 2017. Springer LNCS vol 10302, in press.