| December 01, 2020

Optimization Theory and Perspectives on the Field of Machine Learning

The work of Michael Jordan’s (University of California, Berkeley) research group was a highlight of the 2020 SIAM Conference on Mathematics of Data Science (MDS20), which took place virtually earlier this year. The team’s research focuses on the connections between optimization, geometry, probability, dynamical systems, and machine learning (ML). During his plenary talk at MDS20, Jordan presented an impressive array of theoretical results in optimization that were established by his group and motivated by ML. He also offered a broad vision of ML’s past and present, and disclosed how he intends to orient his research towards what he sees as the field’s future.

Gradient Descents and Saddle Points

Optimization is at the heart of model building in ML. One can broadly categorize optimization problems as either convex or nonconvex, depending on the type of function to be minimized and the space over which it will be optimized. Convex optimization problems are generally well understood and mathematically friendly; researchers can establish provable convergence results for algorithms, and the algorithms that solve them behave well in practice.

Most optimization problems in ML and data science are decidedly nonconvex. Objective functions are often nonlinear, and decision variables are frequently binary and/or integer-valued — especially when one wants to make a decision or recommendation. Jordan reviewed a range of proven results that relate to high-dimensional gradient descent (GD), which iteratively improves an initial guess \(x_0\) and produces a sequence of approximations \(x_k\) that ideally converge to a global minimum of a function \(f(x)\) by following the gradient “downhill.”

Figure 1. Point \(u\) lies in the green region, where the gradient \(\nabla f\) (represented by the blue arrows) will trap it near the saddle point in the center for many iterations. The perturbed point \(w\) will quickly escape. Figure courtesy of Michael Jordan [1].

First, Jordan highlighted a well-established result from Yurii Nesterov’s 1998 lecture notes [2]. For unconstrained convex optimization, GD convergence rates do not depend on the space’s dimensionality. In other words, if we measure a notion of how close we are to the global minimum, convergence rates depend on the function’s properties and the number of descent steps; they are independent of whether a problem has 10 variables or 10 million. Readers might find this dimension-independence surprising, though perhaps there is a “catch” that practical problems may have a correlation between dimensionality and their Lipschitz constant.

To extend to nonconvex optimization, Jordan emphasized the main results of his 2017 study [1], which proves similar convergence rates with a few key differences. Nonconvex functions can have local minima and saddle points that trap and slow descent methods. He and his collaborators use a “perturbed” GD method—not to be confused with stochastic GD—to address this fact, occasionally adding a random jitter to the current iteration that may have otherwise trapped GD in a saddle point or local minimum. Expecting convergence to a global minimum in this general setting is too optimistic. But softening the statement to “strongly convex” minima produced a result that is only mildly worse than the convex case — with an extra factor of \(\log^4(d)\) (with \(d\) as the dimension) that is conceptually a slowdown related to the escape from saddle points [1]. The group’s proof technique is probabilistic and relies on geometry around the saddle points. Theoretically, aside from a narrow strip in a low-dimensional space around a saddle point, points that are initially in the highlighted region will escape after being perturbed away (see Figure 1).

All else equal, the theory thus states that a problem in 10 million dimensions will converge slower—by a factor of 7⁴, or 2,401—than a problem in 10 dimensions. Jordan believes that the exponent 4 is an artifact of the proof, which ought to be a plain \(\log(d)\). However, he has been unable to improve upon it. Perhaps this is an opportunity for future collaboration.

Discrete and Continuous Time Dynamical Systems

Another aspect of Jordan’s research involves understanding the connections between GD and other optimization algorithms, as well as the underlying mathematics that connects the algorithms’ transitions between discrete and continuous time. Jordan explored this area and compared observations about some of the proven results. For instance, while GD has an asymptotic convergence rate \(O(1/k)\) for unconstrained convex optimization (see Figure 2), Nesterov demonstrated that “accelerated” GD—a two-term method—has a faster convergence rate of \(O(1/k^2)\).

Figure 2. Comparisons between gradient descent (GD) and accelerated GD. The accelerated approach has provably faster convergence and limits to an interesting continuous-time flow. The differential equation for accelerated GD is courtesy of [3].

To unify these and other descent methods, researchers have been studying the corresponding continuous-time flows that are associated with a discrete descent method. For example, one can understand GD as the time discretization of so-called gradient flow when taking the stepsize \(\beta \to 0\); here, one keeps the corresponding “rate” of \(O(1/t)\). Jordan then observed that previous work has utilized the same concept to identify a second-order differential equation that corresponds to accelerated GD, which has a peculiar form [3]. The analogous rate of \(O(1/t^2)\) is again kept.

Given these observations, a motivating question for Jordan was whether the results could be extended. This was the topic of his 2016 paper with Andre Wibisono and Ashia Wilson [4]. The group approached this query from the perspective of variational calculus and defined the so-called Bregman Lagrangian, from which one can build a family of differential equations that include gradient flow, accelerated gradient flow, and a continuum of other differential operators. Many interesting details are available in this work and subsequent studies; for instance, while one can achieve exponential convergence rates in continuous time—compared with algebraic rates for GD and accelerated GD—the same convergence rate provably cannot retain its speed after discretization. To provide some intuition into this phenomenon, Jordan remarked that a bound of \(O(1/k^2)\) for discrete convergence rates stems from the fact that GD methods will inevitably slow down in regions of high curvature.

The results that Jordan presented during this portion of his talk went well beyond this concept and included approaches for obtaining rates, backward error results, and explicit schemes via the power of numerical analysis, symplectic geometry, differential manifolds, Hamiltonians, and dissipative systems.

Machine Learning and Chemical Engineering

Jordan had quite a lot to say about his vision of ML as a budding field. His first observation pertained to recommendation systems, wherein a machine offers a user a recommendation based on two or more choices. However, the ideal approach to recommendation varies with different applications. Suppose the system recommends the “best” option to every user. If it recommends the “best” driving route from point A to point B or the “best” restaurant to everyone, issues will quickly arise. Jordan argues that microeconomics will play an upcoming role in these types of settings. Traditional markets—while not a direct algorithm and not without their own issues—are adaptive, robust, and scalable in a way that already handles these difficulties; therefore, he believes that researchers should consider them as alternatives to artificial intelligence (AI).

Jordan was also concerned with the increasingly common use of AI-driven decisions in medicine. Suppose that a doctor feeds patient data into an AI model that is trained to detect a disease and receives a numerical score of 0.71, just above threshold 0.70 for positive detection. Should the patient receive treatment? Jordan contends that one should not utilize such thresholds in isolation. Instead, the doctor should account for factors that relate to the data that trains the algorithm. Frameworks must also be established for practitioners to quickly understand how and why the AI system in question made its decision, including error bars, applicability of the associated study data that drives the decision, and so on.

Jordan classified the past decade of ML as a “pattern recognition” era, with a focus on tasks like speech recognition, computer vision, and natural language processing. Such algorithms are significantly impacting billions of people around the world. The applications to which Jordan alludes, as well as others in the area of pattern recognition, are most transparent in our daily lives. However, algorithms that handle loan and job applications, predict recidivism, and so forth arguably have equal—if not more long-term—impacts on society. As humanity strives towards an increasingly just and equitable future, holistic methods will likely be an ongoing research priority.

This article is based on Michael Jordan’s invited talk at the 2020 SIAM Conference on Mathematics of Data Science (MDS20), which occurred virtually earlier this year. Jordan’s presentation is available on SIAM’s YouTube Channel.

References
[1] Jin, C., Rong, G., Netrapali, P., Kakade, S., & Jordan, M. (2017). How to escape saddle points efficiently. Preprint, arXiv:1703.00887.
[2] Nesterov, Y. (1998). Introductory lectures on convex programming. Volume I: Basic course. Lecture notes.
[3] Su, W., Boyd, S., & Candes, E. (2014). A differential equation for modeling Nesterov’s accelerated gradient method: Theory and insight. In Advances in neural information processing systems 27 (NIPS 2014) (pp. 2510-2518). New York, NY: Curran Associates, Inc.
[4] Wibisono, A., Wilson, A.C., & Jordan, M.I. (2016). A variational perspective on accelerated methods in optimization. PNAS, 113(47), E7351-7358.

Manuchehr Aminian is an assistant professor in the Department of Mathematics and Statistics at California State Polytechnic University, Pomona. His interests include mathematical modeling, visualization, and mathematical methods in data science.