Scalable Optimization Methods for Machine Learning
Acceleration, Adaptivity and Structured Non-Convexity
Time: Thu 2021-08-26 16.00
Subject area: Electrical Engineering
Doctoral student: Vien Van Mai , Reglerteknik
Opponent: Professor John Duchi, Stanford university
Supervisor: Mikael Johansson, Reglerteknik
This thesis aims at developing efficient optimization algorithms for solving large-scale machine learning problems. To cope with the increasing scale and complexity of such models, we focus on first-order and stochastic methods in which updates are carried out using only (noisy) information about function values and (sub)gradients. The broad question we ask is: given such algorithmic oracles, how can we best design algorithms that both have strong theoretical guarantees and are practically useful. To this end, we touch upon a wide range of problems and investigate various methods (including newly proposed and successful existing ones) for solving them. We pay significant attention to developing and analyzing methods that are easy to run at scale, less sensitive to parameter selection, and adaptive to the underlying problem structure.
In somewhat more detail, the first part of the thesis focuses on two fundamental classes of optimization problems which underpin many modern machine learning applications. We begin with deterministic composite minimization problems and show how to adapt Anderson acceleration---a memory-efficient and line search-free method with strong adaptation ability---to proximal gradient methods. Motivated by sensitivity and instability issues of the classical stochastic gradient descent method (SGD), we then turn to stochastic problems and investigate a number of techniques for improving the practical performance of SGD and its relatives. We derive several novel convergence results that cover important settings for which previous theory did not apply. For example, we move beyond stochastic convex optimization and establish sample complexity results for a momentum extension of SGD on weakly convex functions, and we look beyond quadratic growth and demonstrate improved stability and convergence of gradient clipping of SGD on problems with rapidly growing gradients.
In the second part, we consider somewhat more structured problems and explore how recent advances in first-order and stochastic optimization can be exploited to expedite solution times even further. The common feature of the algorithms in this part is that they are constructed through a careful combination of (i) acceleration and variance reduction techniques to decrease the total number of iterations; (ii) inexact (noisy) computations that execute each iteration faster, and only to the precision actually required; and (iii) efficient warm-start schemes. We establish complexity guarantees of the proposed algorithms and quantify the improved run time in terms of the problem-dependent parameters.