Momentum-based variance reduction in non-convex SGD

Cutkosky, Ashok; Orabona, Francesco

Momentum-based variance reduction in non-convex SGD

Files

1905.10018.pdf(1.57 MB)

Published version

Date

2019-12-08

Authors

Cutkosky, Ashok

Orabona, Francesco

URI

https://hdl.handle.net/2144/40897

OA Version

Published version

Citation

Ashok Cutkosky, Francesco Orabona. 2019. "Momentum-Based Variance Reduction in Non-Convex SGD." Advances in Neural Information Processing Systems

Abstract

Variance reduction has emerged in recent years as a strong competitor to stochastic gradient descent in non-convex problems, providing the first algorithms to improve upon the converge rate of stochastic gradient descent for finding first-order critical points. However, variance reduction techniques typically require carefully tuned learning rates and willingness to use excessively large “mega-batches” in order to achieve their improved results. We present a new algorithm, Storm, that does not require any batches and makes use of adaptive learning rates, enabling simpler implementation and less hyperparameter tuning. Our technique for removing the batches uses a variant of momentum to achieve variance reduction in non-convex optimization. On smooth losses F, Storm finds a point x with E[k∇F(x)k] ≤ O(1 /√ T + σ^1/3 /T^1/3) in T iterations with σ^2 variance in the gradients, matching the optimal rate and without requiring knowledge of σ.

Collections

BU Open Access Articles
ENG: Electrical and Computer Engineering: Scholarly Papers

Full item page