Adaptive strategies in non-convex optimization

Date
2022
DOI
Authors
Zhuang, Zhenxun
Version
OA Version
Citation
Abstract
Modern applications in machine learning have seen more and more usage of non-convex formulations in that they can often better capture the problem structure. One prominent example is the Deep Neural Networks which have achieved innumerable successes in various fields including computer vision and natural language processing. However, optimizing a non-convex problem presents much greater difficulties compared with convex ones. A vastly popular optimizer used for such scenarios is Stochastic Gradient Descent (SGD), but its performance depends crucially on the choice of its step sizes. Tuning of step sizes is notoriously laborious and the optimal choice can vary drastically across different problems. To save the labor of tuning, adaptive algorithms come to the rescue: An algorithm is said to be adaptive to a certain parameter (of the problem) if it does not need a priori knowledge of such parameter but performs competitively to those that know it. This dissertation presents our work on adaptive algorithms in following scenarios: 1. In the stochastic optimization setting, we only receive stochastic gradients and the level of noise in evaluating them greatly affects the convergence rate. Tuning is typically required when without prior knowledge of the noise scale in order to achieve the optimal rate. Considering this, we designed and analyzed noise-adaptive algorithms that can automatically ensure (near)-optimal rates under different noise scales without knowing it. 2. In training deep neural networks, the scales of gradient magnitudes in each coordinate can scatter across a very wide range unless normalization techniques, like BatchNorm, are employed. In such situations, algorithms not addressing this problem of gradient scales can behave very poorly. To mitigate this, we formally established the advantage of scale-free algorithms that adapt to the gradient scales and presented its real benefits in empirical experiments. 3. Traditional analyses in non-convex optimization typically rely on the smoothness assumption. Yet, this condition does not capture the properties of some deep learning objective functions, including the ones involving Long Short-Term Memory (LSTM) networks and Transformers. Instead, they satisfy a much more relaxed condition, with potentially unbounded smoothness. Under this condition, we show that a generalized SignSGD (update using only the sign of each coordinate of the stochastic gradient vector when running SGD) algorithm can theoretically match the best-known convergence rates obtained by SGD with gradient clipping but does not need explicit clipping at all, and it can empirically match the performance of Adam and beat others. Moreover, it can also be made to automatically adapt to the unknown relaxed smoothness.
Description
License
Attribution-NonCommercial-ShareAlike 4.0 International