# Optimization in Machine Learning and Data Science

Machine learning (ML) and artificial intelligence (AI) have burst into public consciousness in the last several years. While large language and multimodal models like GPT-4 have recently taken the excitement to a new level, developments in voice recognition software, novel recommendation systems for online retailers and streaming services, superhuman-level play by computers in Chess and Go, and unfulfilled promises in technologies like self-driving cars have been generating interest for more than a decade. Many research disciplines are feeling the profound effects of AI. For example, scientists can now utilize neural networks (NNs) to predict a protein’s structure based on its amino acid sequence [3] — a problem that was identified decades ago as a grand challenge for computational science.

ML, AI, data science, data analysis, data mining, and statistical inference all have different but overlapping meanings; the term “data science” is perhaps the most general. The remainder of this article will refer to ML since the problems that we discuss tend to cluster in that area.

Modern ML rests on multidisciplinary foundations in statistics, mathematics, and computer science. Optimization plays a central role by providing tools that formulate and solve computational problems in ML. Optimization *formulations *encapsulate statistical principles and distill them into computational problems that are solvable with algorithms and software. They also enable tradeoffs between competing goals via the judicious use of constraints or weighting of terms in the objective. Additionally, recent years have seen an outburst of research around optimization *algorithms *that are suited to the structure, scale, and context of ML applications — often building upon long-established foundations with new perspectives and insights.

**Figure 1.**The ImageNet dataset contains more than a million photos of one thousand objects. Figure courtesy of the ImageNet database at Princeton University and Stanford University.

After preprocessing, the \(m\) items of data that are presented to a ML system each consist of a vector of *features *\(a_j\), \(j=1,2,...,m\) and a *label *\(y_j\) that is associated with each feature. The fundamental task is to learn a function \(\phi\) that (approximately) maps each \(a_j\) to its associated \(y_j\). The process of determining \(\phi\) is often known as *training*, and the data \((a_j, y_j)\), \(j=1,2,...,m\) are called *training data*. Usually, \(\phi\) is parametrized by some finite vector \(x\) and the training problem reduces to finding \(x\) such that \(\phi(a_j;x)\approx y_j\) for all \(j=1,2,...,m\); in short, it becomes a data-fitting problem. In a NN, \(x\) can be the vector of weights on all of the connections in the network — a vector that is believed to have more than a trillion components in the GPT-4 NN.

To formulate the data-fitting problem as an optimization problem, we define an objective \(f(x)=\frac{1}{m}\Sigma^m_{i=1}\ell(\phi(a_j;x),y_j)\), where the *loss function* \(\ell\) measures the discrepancy between the prediction \(\phi(a_j;x)\) and the true label \(y_j\) on a single data point. This “finite-sum” form of the objective is typical of optimization problems in ML.

Besides fitting the training data, the mapping \(\phi\) often needs to satisfy certain additional properties. Most importantly, it should *generalize *to other data that are similar to the training data. If we present \(\phi\) with a feature vector \(a_k\) that is drawn from the same distribution as the training data (but is not identical to any particular training item), the mapping should make a good prediction of the label \(y_k\). We might also require prior knowledge about \(\phi\) to be incorporated into the optimization formulation, alongside the knowledge that is represented by the training data. The addition of *regularization *terms to the objective \(f\) can capture these and other desirable properties. For example, we can add a term \(\lambda\|{x}\|^2_2\) for some parameter \(\lambda>0\) to suppress the size of \(x\) and (loosely speaking) find a solution that is less sensitive to certain types of noise in the data.

This general framework encompasses a wide range of paradigms in ML. When \(y_j\) is a real number, we can point to the special case of classical linear least-squares regression, where \(\phi(a_j;x)=a_j^T x\) and \(\ell(\phi(a_j;x),y_j)=(a^T_j x-y_j)^2\). Generalizations of least-squares regression abound. The addition of regularization terms such as \(\lambda\|x\|^2_2\) or \(\lambda\|x\|_1\) induce certain properties on \(x\). We can replace the square loss function \(\ell\) with alternatives like absolute values or other robust losses, or swap the simple linear parametrization of \(\phi\) with a NN — a tactic that typically generates a weight vector \(x\) that has many more components and leads to a more powerful class of functions \(\phi\).

When \(y_j\) is a label from a finite set \(\{1,2,...,C\}\)—as in an image classification scenario—the problem is one of *multiclass classification*. The logistic loss (also known as cross entropy) is the most common objective function for this type of problem, though least-squares losses have also been proposed.

Low-rank matrix problems form another important class of modern ML challenges. In our framework, \(x\) parametrizes a low-rank matrix and \((a_j, y_j)\) captures certain observations about the matrix, such as a single element or a random linear combination of elements. Low-rank matrices appear in recommendation systems like the famous Netflix Prize problem, which sought to predict the rating that every user would give to every movie on the streaming platform based on a relatively small sample of ratings from each user. The contest revealed that a linear combination of the preferences of several archetypal movie fans expresses each individual’s taste fairly accurately, meaning that the matrix of preferences is approximately low rank.

The label \(y_j\) is not present in certain other ML problems, but it still may be interesting to identify structures in the collection of feature vectors. For example, we might find that all \(a_j\) lie near a low-dimensional manifold, or that the \(a_j\) cluster into a few distinct regions. This structure may reveal some useful information about the data and can even enable prediction. For instance, when a new feature vector \(a_k\) lies in a certain cluster, it presumably shares properties with the other feature vectors within that cluster.

What optimization algorithms do researchers use to solve problems that are formulated in the ways described above, particularly to minimize the functions with “finite-sum” form? The most common algorithms for smooth functions are based on gradient descent, which takes steps in a direction that approximates the objective’s negative gradient. The most important algorithm of this type is *stochastic gradient*. It is commonly known as SGD where the “D” stands for “descent,” though this algorithm may not reduce the function value at each iteration. SGD uses the gradient approximation

\[\frac{1}{|\mathcal{B}|}\sum_{i\in\mathcal{B}}\nabla_x\ell(\phi(a_j;x),y_j) \approx \nabla f(x)=\frac{1}{m}\sum^m_{i=1}\nabla_x\ell(\phi(a_j;x),y_j) \]

for some random subset \(\mathcal{B} \subset \{1,2,...,m\}\). While the full gradient requires access to the entire training set (which is impractical in most contexts), evaluation of the approximate gradient is much cheaper when \(|\mathcal{B}| \ll m\) (as is typically the case).

One can enhance the gradient and stochastic gradient methods in many ways. A proximal-gradient approach explicitly handles any nonsmooth regularization terms in the objective. The use of *momentum*—for which the current search direction depends on the search directions at many previous iterations, rather than just the gradient at the current point—can improve performance in both theory and practice. The idea of momentum harks back to conjugate gradient methods in the 1950s, but the touchstone reference in the convex nonlinear setting is Yurii Nesterov’s famous paper [5]. In addition, element-wise scaling of the search direction vector based on accumulated gradient information can enhance performance in practice; NN training employs the Adam approach almost universally [4].

Although NN training is the 800-pound gorilla of ML computations, a plethora of other ML problems utilize optimization formulations and algorithms in novel ways. Techniques from robust and distributionally robust optimization can handle various types of data uncertainty or improve the solutions’ resistance to adversarial actions. Conditional gradient or Frank-Wolfe methods are beneficial in many constrained nonlinear problems for which one can cheaply minimize a *linear *function over the constraint set. Mirror descent is applicable when the constraint set is a probability simplex, as is the case in many ML applications. Kernel learning, which was the dominant paradigm before the NN juggernaut arrived, made ingenious use of convex programming duality [1]. Coordinate descent and the alternating direction method of multipliers are relevant to many applications, including those that involve matrices. Quasi-Newton and Newton approaches have not yet found widespread application, but investigations continue. Further details about some of these techniques are available in the literature [2, 6].

ML has influenced optimization in several fundamental ways beyond the challenges that arise from the scale and structure of its applications. For instance, the phenomenon of “benign nonconvexity” has emerged over the past 10 years. Finding the global minimum of a nonconvex function traditionally seemed impossible; the most that one could hope for was a local solution. But in many ML-based nonconvex optimization scenarios, finding a global minimizer is tractable from both a theoretical and practical perspective.^{1} In some problems, it is easy to identify an initial point that is close to the global solution. In other problems, all local solutions are in fact global solutions. In still other problems, all stationary points (points with a zero gradient) are either global solutions or strict saddle points that are easily avoided.

Convergence to global minimizers often occurs in the training of overparametrized NNs, wherein the number of weights (length of vector \(x\)) far exceeds the number of items \(m\) in the training set. Given the nonconvex, nonsmooth nature of the objective function and the relatively naive strategy employed by the SGD algorithm, this development was certainly unexpected. Although researchers have explored several perspectives on this fascinating phenomenon, none have yet provided a complete explanation.

ML has also influenced analysis styles for optimization methods. Complexity analysis—worst-case bounds on the amount of computation that is required to find approximate solutions—has become much more prominent, while the analysis of local convergence properties is less of a focal point. Convexity assumptions were prevalent for many years, but this paradigm has since changed due to the inherent nonconvexity of many applications.

Optimization’s engagement with data science and ML has benefited all of these research fields. Fresh perspectives and demands from ML applications have challenged traditional optimization approaches and modes of research. The influx of researchers with ML backgrounds who are collaborating with optimizers or performing their own fundamental work in optimization has revitalized the optimization community. New issues and paradigms continue to arise, and we look forward to many more years of exciting developments.

^{1}A list of such problems is maintained here.

**References**

[1] Boser, B.E., Guyon, I.M., & Vapnik, V.N. (1992). A training algorithm for optimal margin classifiers. In *Proceedings of the fifth annual workshop on computational learning theory (COLT’92)* (pp. 144-152). Pittsburgh, PA: Association for Computing Machinery.

[2] Bottou, L., Curtis, F.E., & Nocedal, J. (2018). Optimization methods for large-scale machine learning. *SIAM Rev.*, *60*(2), 223-311.

[3] Jumper, J., Evans, R., Pritzel, A., Green, T., Figurnov, M., Ronneberger, O., … Hassabis, D. (2021). Highly accurate protein structure prediction with AlphaFold. *Nature*, *596*, 583-589.

[4] Kingma, D.P., & Ba, J. (2014). Adam: A method for stochastic optimization. Preprint, *arXiv:1412.6980*.

[5] Nesterov, Y.E. (1983). A method of solving a convex programming problem with convergence rate \(

O(1/k^2)\). *Soviet Math. Dokl.*, *27*(2), 372-376.

[6] Sra, S., Nowozin, S., & Wright, S.J. (Eds.) (2011). *Optimization for machine learning*. Cambridge, MA: The MIT Press.

[7] Vapnik, V.N. (1999). *The nature of statistical learning theory*. New York, NY: Springer Science & Business Media.

Stephen J. Wright is a professor of computer sciences at the University of Wisconsin-Madison. |