| March 01, 2019

Knowing What to Know in Stochastic Optimization

The National Science Foundation’s (NSF) TRIPODS—Transdisciplinary Research in Principles of Data Science—program is part of an effort towards “Harnessing the Data Revolution,” one of the NSF’s “10 Big Ideas” of the decade. The program aims to bring together theoretical computer scientists, mathematicians, and statisticians to develop mathematical foundations for this aptly-named revolution. Several of the 12 TRIPODS-supported teams focus their research on novel continuous optimization algorithms, which lie at the core of most foundational data science topics. In its simplest and most abstract form, we can state the optimization problem as

\[\min\limits_{x\in {\cal X}} f(x), \]

where \(f\) is assumed to be continuous but may or may not be smooth and/or convex.

Optimization is by no means a new field. It underwent significant development—both in theory and practice—during the 1980s and 1990s, and by the end of the 20th century it was seemingly well understood. However, the “Big Data Revolution” presents new optimization challenges for two main reasons: the massive amounts of data that must be utilized and the inherent inexactness of this data. Most traditional methods are unable to handle these new applications and must be redesigned. In particular, researchers have developed traditional methods under the assumption that we can compute quantities \(f(x), \nabla f(x),\) and possibly \(\nabla^2 f(x)\) exactly or sufficiently accurately. Consequently, we can compute steps of optimization algorithms, such as the gradient descent step \(x^{k+1} =x^k - \alpha_k \nabla f(x^k)\) or the Newton step \(x^{k+1} =x^k-\alpha_k [\nabla^2f(x^k)] ^{-1}\nabla f(x^k),\) either exactly or with sufficient (deterministic) accuracy. When handling large and inaccurate data, the exact function \(f(x)\) may be unknown or simply too expensive to compute via deterministic methods. Thus, the new optimization paradigm focuses on methods where at least some of this information is computed inexactly and randomly. For instance, a stochastic estimate \(g^k\) may replace the gradient \(\nabla f(x^k)\) in a gradient descent method. Likewise, employment of randomized linear algebra techniques can compute an approximate estimate of \(\nabla^2 f(x^k)\) or \([\nabla^2f(x^k)]^{-1}\) for the corresponding Newton step.

Foundational research in stochastic optimization is meant to define general conditions on the inexact random information that leads to convergent algorithms, thus enhancing our understanding of when and how one can apply these algorithms. For example, consider stochastic gradient descent (SGD)—the most popular optimization algorithm for machine learning models—which takes steps of the form \(x^{k+1} =x^k-\alpha_k g^k,\) where \(g^k\) is an unbiased random estimate of \(\nabla f(x^k).\) If \(g^k\) is readily available, SGD can be very efficient and computationally inexpensive, which is the case with many popular machine learning models like logistic regression and neural networks. On the other hand, some very natural functions in machine learning—such as the “zero-one loss,” which measures a predictor’s error rate—do not allow direct application of SGD because they lack useful unbiased estimates of \(\nabla f(x^k)\).¹ While optimizing the zero-one loss may be a learning algorithm’s true goal, a surrogate loss function for which useful gradient estimates exist is often optimized instead. Yet if we change the condition on the gradient estimates, we can develop convergent optimization algorithms for the zero-one loss and other similar loss functions. For example, we may consider the condition

\[\|{\mathbb E}\:[g^k]-\nabla f(x^k)\|\:\leq \Theta_k\]

with a suitable choice of \(\Theta_k,\) which relaxes the requirement on the unbiasedness of \(g^k\) and allows its computation by some gradient approximation scheme (e.g., finite differences).

SGD has several drawbacks: it might not be robust because of the effect of the variance in \(g^k,\) it is heavily dependent on the choice of the step-size sequence \(\alpha_k,\) and it does not account for the curvature of \(f(x).\) However, imposing stronger conditions on \(g^k\) can remedy these issues. In particular, we can assume that \(g^k\) is a sufficiently accurate estimate of \(\nabla f(x^k)\) with some probability, thus controlling \(g^{k}\)’s variance. More generally, we can consider the following set of conditions on the estimates of the function values \(f_k,\) the gradient \(g^k,\) and the Hessian \(H^k:\)

\[|f_k- f(x^k)|\:\leq \Theta^0_k\]

\[\|g^k-\nabla f(x^k)\|\:\leq \Theta^1_k \tag1 \]

\[\|H^k-\nabla^2 f(x^k)\|\:\leq \Theta^2_k,\]

which should hold with some adequately high probability \(p\) for a suitable choice of \(\Theta^i_k, i=0,1,2.\)

As a result—and given appropriate choices of \(p\) and sequences \(\Theta^i_k\)—we are able to construct stochastic algorithms with similar behavior to classical deterministic algorithms, such as line search and trust-region methods. Such methods can perform line search, utilize second-order information, and display favorable convergence properties without the prohibitive expense of computing deterministic information about \(f(x),\) as occurs with SGD. While SGD has a known convergence rate, defined as the expected accuracy achieved after a certain number of steps, the methods based on (1) have both better expected convergence rates and expected complexity bounds, defined as the bounds on the number of iterations until desired accuracy is reached. Moreover, these bounds hold with high probability [1-4]. One interesting and important observation from the convergence analysis that uses (1) is that the quantities \(\Theta^i_k\) are closely connected with the accuracy achieved by the algorithm at iteration \(k,\) which indicates a tendency to adaptively decrease as the algorithm converges.

Analysis also shows that \(\Theta^0_k\)’s rate of decrease is faster than that of \(\Theta^1_k,\) which is in turn faster than \(\Theta^2_k.\) Thus, getting accurate function value estimates is most important; gradient estimates can be somewhat less accurate and Hessian estimates are allowed the highest amount of error. These realizations will likely lead to new algorithms that can utilize novel randomization techniques. We should expect to see interesting developments in the area of stochastic optimization in the near future.

¹Since sample gradients of this function are zero almost everywhere.

References
[1] Blanchet, J., Cartis, C., Menickelly, M., & Scheinberg, K. (2018). Convergence rate analysis of a stochastic trust region method for nonconvex optimization. Preprint, arXiv:1609.07428.
[2] Paquette, C., & Scheinberg, K. (2018). A stochastic line search method with expected complexity analysis. Preprint, arXiv:1807.07994.
[3] Roosta-Khorasani, F., & Mahoney, M.V. (2018). Sub-sampled newton methods. Math. Program., 1-34.
[4] Tripuraneni, N., Stern, M., Jin, C., Regier, J., & Jordan, M.I. (2018). Stochastic cubic regularization for fast nonconvex optimization. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, & R. Garnett (Eds.), Adv. Neur. Info. Proc. Syst. 31 (pp. 2904-2913). Montreal, Canada.

Katya Scheinberg is the Harvey E. Wagner Endowed Chair Professor in the Department of Industrial and Systems Engineering at Lehigh University. Her research interests lie in the development of efficient and theoretically sound algorithms for continuous optimization and machine learning.