We study and develop deep networks that handle scaling transformations and other image transformations in a theoretically well-founded manner, preferably in terms of provable covariance and invariance properties.
As a first contribution, we have proposed a way to define provably scale-covariant hand-crafted networks, by coupling scale-space operations in cascade. Specifically, we have studied a sub-class of such networks in more detail, also motivated from biological inspiration, by coupling models of complex cells in terms of quasi quadrature measures in cascade. Experimentally, we have evaluated these networks on the task of texture classification, in our experiments with scaling transformations for scaling factors up to 4:
- Lindeberg (2020) "Provably scale-covariant continuous hierarchical networks by coupling scale-normalized differential entities in cascade", Journal of Mathematical Imaging and Vision, 62: 120–148, doi:10.1007/s10851-019-00915-x.
- Lindeberg (2019) "Provably scale-covariant hierarchical networks by coupling quasi quadrature measures in cascade, Proc. SSVM 2019: Scale Space and Variational Methods in Computer Vision, Springer LNCS 11603: 328-340, preprint at arXiv:1903.00289.
We have also studied the ability of CNNs to handle scaling transformations, first demonstrating that traditional CNNs, as well as networks that concatenate the output from different scale channels, suffer from severe problems if they are applied to testing data with scaling transformations for which there are no training data. Sliding window approaches over multiple scale channels also have substantial problems, although not as severe, if subjected to the task over generalizing over scaling transformations between training data and test data. Sliding window approaches also have substantial problems when there are few training data. Then, we have proposed two foveated architectures for scale channel networks, where the fine scale channels accumulate support from smaller regions of interest than the coarser scale channels, with either max pooling or average pooling in the last classification layers. For such networks, it is indeed possible to construct formal proofs of scale invariance properties. These latter network architectures are well able to handle scaling transformations between the training data and the test data over the range of scale factors for which there are supportive scale channels (a few additional scale channels are also needed outside the scale boundaries to handle scale boundary effects). In our experiments, we handle scale factors up to 8:
- Jansson and Lindeberg (2021) "Exploring the ability of CNNs to generalise to previously unseen scales over wide scale ranges", Proc. International Conference on Pattern Recognition (ICPR 2020), pages 1181-1188, extended version in arXiv:2004.01536.
- Jansson and Lindeberg (2021) "Scale-invariant scale-channel networks: Deep networks that
generalise to previously unseen scales", arXiv preprint arXiv:2106.06418.
We have also performed an in-depth study of the ability of spatial transformer networks to support true invariance properties. First, we have shown that spatial transformers that transform the CNN feature maps do not support invariance for purely spatial transformations of CNN feature maps. Only spatial transformer networks that transform the input allow for true invariance properties. Then, we have performed a systematic study of how these properties affect the classification performance. Specifically, we investigate different architectures for spatial transformer networks that make use of more complex features for computing the image transformations that transform the input data to a reference frame, and demonstrate that these new spatial transformer architectures lead to better experimental performance:
- Finnveden, Jansson and Lindeberg (2021) "Understanding when spatial transformer networks do not support invariance, and what to do about it", Proc. International Conference on Pattern Recognition (ICPR 2020), pages 3427-3434, extended version in arXiv:2004.11678.
- Jansson, Maydanskiy, Finnveden and Lindeberg (2020) "Inability of spatial transformations of CNN feature maps to support invariant recognition", arXiv preprint arXiv:2004.14716.
To handle the notion of scale in deep networks, we have also developed a dual scale-channel approach, based on scale channels that are constructed by coupling parameterized linear combinations of Gaussian derivatives in cascade, complemented by non-linear ReLU stages in between, and a final stage of max pooling over the different scale channels. Given that the learned parameters in the linear combinations of Gaussian derivatives are shared between the scale channels, the raw scale channels are provably scale covariant. The final stage after max pooling over the scale channels is, in addition, provably scale invariant. Experimentally, we demonstrate that the approach allows for scale generalization, with good ability to classify image patterns at scales not present in the training data.
Lindeberg (2021) “Scale-covariant and scale-invariant Gaussian derivative networks”, Proc. SSVM 2021: Scale Space and Variational Methods in Computer Vision, Springer LNCS 12679: 3–14, extended version in arXiv:2011.14759.