| January 25, 2021

New Bridges Between Deep Learning and Partial Differential Equations

Understanding the world through data and computation has always formed the core of scientific discovery. Amid many different approaches, two common paradigms have emerged. On the one hand, primarily data-driven approaches—such as deep neural networks (DNNs)—have proven extremely successful in recent years. Their success is based mainly on their ability to approximate complicated functions with generic models when trained using vast amounts of data and enormous computational resources. But despite their many triumphs, DNNs are difficult to analyze and thus remain mysterious. Most importantly, they lack the robustness, explainability, interpretability, and fairness required for high-stakes decision-making. On the other hand, increasingly realistic model-based approaches—typically derived from first principles and formulated as partial differential equations (PDEs)—are now available for various tasks. One can often calibrate these models—which enable detailed theoretical studies, analysis, and interpretation—with relatively few measurements, thus facilitating their accurate predictions of phenomena. However, computational methods for PDEs remains a vibrant research area whose open challenges include the efficient solution of highly nonlinear coupled systems and PDEs in high dimensions.

In recent years, exciting work at the interface of data-driven and model-based approaches has blended both paradigms. For instance, PDE techniques and models have yielded better insight into deep learning algorithms, more robust networks, and more efficient training algorithms. As another example, consider the solution of high-dimensional PDEs, wherein DNNs have provided new avenues for tackling the curse of dimensionality. One must understand that the exchange between deep learning and PDEs is bidirectional and benefits both communities. I hope to offer a glimpse into a few of these activities and make a case for the creation of new bridges between applied mathematics and data science.

Continuous Neural Networks Motivated by Ordinary and Partial Differential Equations

Researchers have traditionally constructed DNNs by concatenating a small, finite number of functions, each consisting of a trainable affine mapping and a pointwise nonlinearity [9, 13]. Because the difficulty of initializing and training the network weights increases with the number of layers, the network’s depth has been limited in practice. However, the arrival of the so-called residual neural networks (ResNet) in 2016—which outperformed traditional networks across a variety of tasks—dramatically changed this situation.

For a simple example of a ResNet in action, consider the training of a neural network that classifies points in \(\mathbb{R}^2\) into two classes based on training data \(\{(\bf{y}\)\(^{(1)},c^{(1)}), (\bf{y}\)\(^{(2)},c^{(2)}),\ldots \} \subset \mathbb{R}^2 \times \{0,1\}\). We have plotted an instance of this scenario in Figure 1a. The deep learning approach to this problem consists of two stages. We first transform the feature space (possibly increasing its dimension) via a neural network. Next, we employ a simple classification model, such as linear multinomial regression. By utilizing a ResNet with \(N\) layers for the first step, we transform the data point \(\bf{y}\) into \(\bf{u}\)\(_N\) as follows:

\[\textbf{u}_0 = \textbf{K}_{\rm in} \textbf{y}\]

\[\textbf{u}_1 = \textbf{u}_0 + h \ \sigma(\textbf{K}_0 \textbf{u}_0 + \textbf{b}_0)\]

\[\vdots \enspace = \quad \vdots\]

\[\textbf{u}_N = \textbf{u}_{N-1} + h \ \sigma(\textbf{K}_{N-1} \textbf{u}_{N-1} + \textbf{b}_{N-1}).\]

Here, \(\sigma(x)=\tanh(x)\) serves as an activation function that is applied element-wise, \(h>0\) is a fixed step size, and \(\bf{K}\)\(_{\rm in} \in \mathbb{R}^{3\times 2}\), \(\bf{K_0}\)\(, \ldots, \bf{K}\)\(_{N-1}\in\mathbb{R}^{3\times 3}\), and \(\bf{b}\)\(_0, \ldots, \bf{b}\)\(_{N-1}\in\mathbb{R}^3\) are the trainable weights. Figure 1b depicts the ResNet’s action for the learned weights, which we determined through optimization. Based on the projections of the transformed points \(\bf{u}\)\(_N\) onto their first two dimensions, it is apparent that solving the classification problem with a linear model has become trivial. Figure 1c displays the trained classifier in the original data space.

Figure 1. Binary classification via a deep residual neural network. 1a. A synthetic dataset consisting of concentric ellipsoids in two dimensions that are labeled into two classes, which are visualized as blue and red points. 1b. The propagated input features for the trained neural network. When trained successfully, the propagated features can be classified with a linear model. We visualize the decision boundary with a black line and the model’s prediction with a background that is colored according to the predicted class. 1c. The classifier’s prediction in the original data space. Figure courtesy of Lars Ruthotto.

One can also interpret the transformed feature \(\bf{u}\)\(_N\) as the forward Euler approximation of \(\bf{u}\)\((T)\) that satisfies the initial value problem

\[\partial_t \textbf{u}(t) = \sigma(\textbf{K}(t) \textbf{u}(t) + \textbf{b}(t)),\qquad t\in(0,T], \qquad \textbf{u}(0) = \textbf{K}_{\rm in} \textbf{y}.\]

Here, \(T>0\) is an artificial final time that is loosely related to the network’s depth [4, 8].

This continuous viewpoint has been popularized in the machine learning community under the term “neural ordinary differential equations” (ODEs) [3]; however, similar ideas are already published [6]. Scientists have recently been applying ODE techniques to create faster, better-understood algorithms for neural networks. For instance, we have proposed new architectures that lead to more stable ODE dynamics [8]. Furthermore, since one might view training as a (stochastic) optimal control problem, efficient solvers for the learning problem (as well as insight into this problem) have resulted from adapted computational science and engineering methods [5, 7].

Analysis of high-dimensional datasets like speech, image, and video data has been a significant focal point for the deep learning community. In fact, deep learning’s breakthroughs in speech and image recognition roughly a decade ago are partly responsible for renewed interest in the subject. Nevertheless, some challenges remain difficult or beyond reach. These ongoing problems include controlling a self-driving car based on predictions made from high-resolution images of street scenes, and reliably computing the volume fraction of COVID-19-affected lung tissue in three-dimensional computed tomography images [10].

While such theoretical and computational challenges may seem insurmountable, we can turn to the field of PDE-based imaging for inspiration. In the last several decades, researchers have created many celebrated algorithms by interpreting image data as discretized functions that can be processed via PDE or integral operators. One can also apply this viewpoint to deep learning with convolutional neural networks whose operators are linear combinations of PDE operators [6].

We have used this observation to extend the neural ODE framework to PDEs and create new types of networks. Specifically, we adapted residual neural networks to form unique models that inherit the stability of parabolic PDEs or—upon suitable discretization—lead to reversible hyperbolic networks [11]. The latter can help overcome memory limitations of current computing hardware. For instance, we trained a hyperbolic network with more than 1,200 layers to classify images on a single graphics processing unit [2].

Deep Learning for the Solution of High-Dimensional PDEs

With few exceptions, the numerical solution of high-dimensional PDEs is challenging due to the curse of dimensionality. As a simple example, consider a finite difference method for the solution of Poisson’s problem on a rectangular grid with \(n\) cells in each dimension. This approach quickly becomes prohibitive as \(d\) grows, since the mesh consists of \(n^d\) cells. The exponential growth of computational costs prohibits the application of the finite difference method—and other methods that rely on grids—to high-dimensional problems that arise in areas like statistics, finance, and economics. To avoid this, one can utilize a neural network to parameterize the PDE solution and rely on the network’s universal approximation properties. While the concept itself is not especially novel, deep learning advances—particularly new architectures, improved theoretical results, optimization algorithms, and easy-to-use software packages—have enabled several impressive outcomes.

One such example is the application of neural networks to high-dimensional mean field games [12]. Mean field games arise in multiple applications [1]. Their solution is characterized by the value function that satisfies a PDE system, which couples the continuity equation and the Hamilton-Jacobi-Bellman (HJB) equation. Computing the value function is extremely challenging due to the forward-backward structure, the HJB equation’s nonlinearity, and the high dimensionality. Our approach employs a neural network that is specifically designed to allow a mesh-free solution of the continuity equation via a Lagrangian method. Although more analysis is needed to fully understand the stochastic non-convex optimization problem that trains the network, our initial results indicate that neural networks can compete with well-understood, mesh-based methods in two dimensions while also being scalable to 100 dimensions.

Outlook

Here I intend to provide a glimpse into the exciting activities and opportunities at the interface of deep learning and applied mathematics. To demonstrate that this is not a one-way street, I discuss the promises of deep learning for difficult and almost impossible problems in applied math, particularly the numerical solution of high-dimensional PDEs.

The coming years will almost certainly see SIAM and its members drive advances in these areas. Given the widespread use of deep learning in real-world applications, one can perhaps expect the biggest impact to stem from mathematical theory—including numerical analysis—that aims to obtain reliable, interpretable, fair, and efficient machine learning models. These models would also enable deep learning in scientific applications where current results suggest significant potential yet open issues, such as convergence guarantees and uncertainty quantification. Finally, fusing data-driven and model-based approaches is a promising means of compensating the lack of first-principle-based models with data in the form of measurements, observations, and simulations.

This article is based on Lars Ruthotto’s invited talk at the 2020 SIAM Annual Meeting, which took place virtually last July. Ruthotto’s presentation is available on SIAM’s YouTube Channel.

References
[1] Caines, P.E. (2020, April 1). Mean field game theory: A tractable methodology for large population problems. SIAM News, 53(3), p. 5.
[2] Chang, B. Meng, L. Haber, E., Ruthotto, L., Begert, D., & Holtham, E. (2018). Reversible architectures for arbitrarily deep residual neural networks. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence (pp. 2811-2818). New Orleans, LA.
[3] Chen, T.Q., Rubanova, Y., Bettencourt, J., & Duvenaud, D. (2018). Neural ordinary differential equations. In Advances in Neural Information Processing Systems 31 (NeurIPS 2018). Montreal, Canada.
[4] E, W. (2017). A proposal on machine learning via dynamical systems. Comm. Math. Stat., 5(1), 1-11.
[5] Gholami, A., Keutzer, K., & Biros, G. (2019). ANODE: Unconditionally accurate memory-efficient gradients for neural ODEs. Preprint, arXiv:1902.10298.
[6] González-García, R., Rico-Martínez, R., & Kevrekidis, I.G. (1998). Identification of distributed parameter systems: A neural net based approach. Comp. Chem. Eng., 22, S965-S968.
[7] Günther, S., Ruthotto, L., Schroder, J.B., Cyr, E.C., & Gauger, N.R. (2020). Layer-parallel training of deep residual neural networks. SIAM J. Math. Data Sci., 2(1), 1-23.
[8] Haber, E., & Ruthotto, L. (2017). Stable architectures for deep neural networks. Inverse Prob., 34(1), 1-22.
[9] Higham, C.F., & Higham, D. (2019). Deep learning: An introduction for applied mathematicians. SIAM Rev., 61(4), 860-891.
[10] Lensink, K., Parker, W., & Haber, E. (2020, July 13). Deep learning for COVID-19 diagnosis. SIAM News, 53(6), p. 1.
[11] Ruthotto, L., & Haber, E. (2020). Deep neural networks motivated by partial differential equations. J. Math. Imag. Vision, 62(3), 352-364.
[12] Ruthotto, L., Osher, S.J., Li, W., Nurbekyan, L., & Wu Fung, S. (2020). A machine learning framework for solving high-dimensional mean field game and mean field control problems. Proc. Natl. Acad. Sci., 117(17), 9183-9193.
[13] Strang, G. (2018, December 3). The functions of deep learning. SIAM News, 51(10), p. 1.

Lars Ruthotto is an applied mathematician who develops computational methods for machine learning and inverse problems. He is an associate professor in the Department of Mathematics and Department of Computer Science at Emory University.