Title: Gradient Descent Dominates Ridge: A Statistical View on Implicit Regularization
NO REGISTRATION REQUIRED - This event is open for all currently-affiliated Columbia University faculty, students, staff, and researchers. While pre-registration is not required, you will need to have an active CUID to enter the Morningside campus and attend the seminars.
Speaker: Jingfeng Wu, Postdoctoral Researcher, Simons Institute for the Theory of Computing, University of California Berkeley
Abstract: A key puzzle in deep learning is how simple gradient methods find generalizable solutions without explicit regularization. This talk discusses the implicit regularization of gradient descent (GD) through the lens of statistical dominance. Using least squares as a clean proxy, we present two surprising findings.
First, GD dominates ridge regression: with comparable regularization, the excess risk of GD is always within a constant factor of ridge, but ridge can be polynomially worse even when tuned optimally. Second, GD is incomparable with SGD. While it is known that for certain problems GD can be polynomially better than SGD, the reverse is also true: we construct problems, inspired by benign overfitting theory, where optimally stopped GD is polynomially worse. Finally, GD dominates SGD for a significant subclass of problems — those with fast and continuously decaying covariance spectra — which includes all problems satisfying the standard capacity condition.
This is joint work with Peter Bartlett, Sham Kakade, Jason Lee, and Bin Yu.