| February 28, 2017

How Many Dimensions is High Dimensions?

The best way to deal with the curse of dimensionality is to reduce the problem dimension. The SIAM Workshop on Parameter Space Dimension Reduction, happening July 9-10 at the Omni William Penn Hotel in Pittsburgh, Pa., will explore several emerging parameter space dimension reduction techniques.

How many dimensions is high dimensions? The answer, of course, depends on the context – specifically, the notion of dimension in the problem. For physical models in science and engineering, a simulation using all four dimensions of space and time may exceed a laptop’s resources. So four is high. For data sets where each feature of an observation is a dimension, thousands of dimensions might be too much to handle using off-the-shelf spreadsheet software. So thousands is high. A large collection of megapixel resolution images, where each pixel is treated as a dimension, may be too large to load into memory for processing. So millions is high. For linear systems of equations where each unknown is a dimension, anticipated extreme-scale computing platforms promise to solve systems with trillions of dimensions. So trillions is high. Here’s a working characterization of high dimensions: the dimension is so high that algorithms for the associated problem stress available computing resources. For computational and data scientists with pressing questions, the goal is to analyze the data or evaluate the model to find an answer. The dimensionality is not really an issue until it’s high enough to severely restrict or prohibit the processes that answer the pressing question.

Algorithm analysis has a more abstract perspective. Instead of trying to solve one particular problem, it studies how the cost of solving a class of problems changes as the dimension changes. For example, an algorithm for solving a linear system in $d$ dimensions might cost $O(d^3)$ additions and multiplications. The big-$O$ notation is not about an exact quantifiable cost that compares to computing resources. It’s about how the algorithm’s cost scales as the dimension $d$ increases (cubically, in the linear system example). An algorithm for the same linear system whose cost scales like $O(d^2)$ (i.e., quadratically) is faster than the $O(d^3)$ algorithm, since its cost grows more slowly for large enough $d$. The big-$O$ notation hides constants that impact actual costs in particular problems and implementations; the $O(d^3)$ algorithm might be cheaper than the $O(d^2)$ algorithm for a particular problem with fixed $d$ on a particular computer. But as $d$ grows—e.g., if the linear system represents a discrete approximation of a differential equation where increasing $d$ improves the approximation—the scaling eventually kicks in, and the algorithm with better scaling becomes much cheaper.

When an algorithm’s cost scales exponentially with the problem dimension—like $O(n^d)$ for some $n$ related to the algorithm—the algorithm suffers from the curse of dimensionality. Exponential scaling is really, really bad for an algorithm, because it limits the algorithm to problems with fewer than a handful of dimensions on any computing platform. The famous wheat and chessboard fable is a simple illustration of how everyday numbers become unfathomably large when plugged into an exponent. Put one grain of wheat on the first square of a chessboard, two on the second, four on the third, eight on the fourth, and so on. When you reach the far corner of an 8-by-8 chessboard, you’ve placed $2^{64} -1 = 18446744013709551615$ wheat grains on the board. So if the cost of your algorithm scales like $O(2^d)$, then a problem in $64$ dimensions costs about $18.5$ quintillion operations (assuming big-$O$ constants around $1$). If “operations” means floating point operations, then this $64$-dimensional problem takes about $2.5$ minutes on the world’s fastest supercomputer at peak theoretical performance (see the Top500 List). The impact of exponential scaling becomes apparent when the dimension increases; if you double the dimension from $64$ to $128$ (two chess boards), your $O(2^d)$ algorithm suddenly costs about $3 \times 10^{38}$ operations or $116$ trillion years on the same supercomputer – roughly $8,400$ times the age of the universe.

In fact, the situation is much worse than just described. Exponential scaling with dimension may be a fundamental property of the problem, independent of any algorithm. The field of information-based complexity [3] develops formalisms to study how the amount of information required to approximate the solution to a multivariate problem scales with dimension; a problem is intractable if the amount of required information grows exponentially with dimension. Examples of such multivariate problems include integrating, approximating, or optimizing a function of several variables – you know, the sorts of things we need to do all the time in computational and data sciences. In these cases, the dimension is the number of variables, and one piece of information is one function evaluation. Following David Donoho, “The curse of dimensionality is a phrase used by several subfields in the mathematical sciences; I use it here to refer to the apparent intractability of systematically searching through a high-dimensional space, the apparent intractability of accurately approximating a general high-dimensional function, the apparent intractability of integrating a high-dimensional function” [1].

I’m most familiar with two classes of computational science problems where the curse of dimensionality looms large: model-based design and uncertainty quantification. In model-based design, the goal is to find design parameters that optimize the model’s performance metric. If the model is complex (e.g., contains multiple coupled physical components), then it’s hard to know if the performance metric as a function of the design parameters admits structure (e.g., convexity) that lets you skirt the curse of dimensionality. The only way to guarantee that you’ve found the optimal design is, essentially, to check all possible designs. And exhaustive search is one of those algorithms whose cost grows exponentially with the number of design parameters. In uncertainty quantification, one goal is to estimate tail probabilities of model outputs under assumptions about uncertainty in model inputs. Again, in the absence of known and exploitable structure, the only way to guarantee high accuracy in tail probability estimates is to check all possible inputs that are consistent with the input uncertainty assumptions. And the cost of exhaustive checking scales exponentially with the number of uncertain inputs.

How do we deal with the curse in these settings? One common approach is to use a cheaper surrogate model (e.g., a response surface or reduced-order model) in place of the complex model for each evaluation. If it takes $N$ model evaluations to quantify uncertainty or optimize a design, and the surrogate model is ten times cheaper than the original model, then the surrogate-based approach costs $N/10$ – which might be cheap enough to handle with your computer. This may address our working definition of high dimensions from the first paragraph, but it does not address the exponential scaling that is fundamental to the design and uncertainty quantification problems. That number $N$ is hiding something very dangerous – namely, that it grows exponentially with the problem dimension. The surrogate-based approaches fight exponential scaling with algebraic cost reduction; as the dimension increases, the cost of design or uncertainty quantification rapidly exceeds any fixed computational capabilities regardless of how cheap each model evaluation is. (I’ve ignored two important issues: (i) constructing an accurate surrogate is one of those problems whose cost scales exponentially with dimension and (ii) it’s often not clear how to connect surrogate-based design or uncertainty studies to the original model.)

Another idea is to use a more powerful computer. For certain performance measures, computing power has grown exponentially over time thanks to technology advances. But time progresses more slowly than we can add dimensions to the models. If the Department of Energy’s Exascale Computing Project lives up to its promises, it will produce a machine with a thousand times more computing power than the fastest current supercomputers. But a thousand-fold increase in computing power only lets us add two or three more dimensions to the model thanks to exponential scaling. So bigger computers will not cure the curse.

The most effective way to fight the curse of dimensionality is to reduce the problem dimension. Here’s a thought experiment to compare with the surrogate-based approaches. Consider a model with six independent parameters where each evaluation costs $1. To study the model’s behavior over the parameter space, assume we need to evaluate the model on a parameter grid with $10$ points per dimension—a total bill of $1M. Say we develop a surrogate model that is ten times cheaper than the original model, and we deem it sufficiently accurate to study the original model’s parameter dependence. (Equivalently, we could assume we can run $10$ independent models in parallel.) The bill drops to $100k – great savings! If instead we identify exploitable two-dimensional structure in the map from the six parameters to the model output, then we can use a grid with $100$ parameters ($10$ points per reduced parameter). The bill drops to $100 – much greater savings! Dimension reduction, when possible, fights exponential scaling with exponential reduction.

The key qualifier in the previous example is exploitable. There are many techniques that reduce dimension, but each technique’s applicability depends on (i) the problem it’s designed for and (ii) that problem’s notion of dimension. In the previous example, principal component analysis would not find any low-dimensional linear relationships among the six parameters because the parameters are independent. If the model outputs are vector-valued, we could try to reduce the output dimension with the proper orthogonal decomposition. But it’s not clear how the subspace coordinates relate to the input parameters—or how to exploit that relationship (even if we knew it) to reduce the parameter space dimension. The parameter space dimension is the important notion of dimension for the parameter study’s cost.

Here’s a simple example of exploitable dimension reduction for our context—and it’s something that scientists and engineers do instinctively. Of the model’s six parameters, identify the two most important. By important I mean that changes in the parameters make relatively large changes to the model outputs. The field of sensitivity analysis [2] develops both metrics for assessing importance and computational tools for estimating those metrics. Now fix the four unimportant parameters at some nominal value; it shouldn’t matter much where they’re fixed since they’re relatively unimportant. And run the parameter study on the two important parameters. Voila! Dimension reduction! Of course, this approach requires that the model contain a subset of the parameters sufficiently unimportant to ignore for the parameter study. (You might ask, if they aren’t important, why are they in the model?) Moreover, the cost of identifying this subset must be reasonable. If the cost of estimating sensitivity metrics scales exponentially with the parameter space dimension, then we haven’t gained anything. With that said, we have a lot of overhead to work with in models with many input parameters – that is, in high dimensions. The prospect of exponential cost reduction for parameter studies via parameter space dimension reduction should motivate us to pursue all sorts of techniques for identifying exploitable low-dimensional structures.

The SIAM Workshop on Parameter Space Dimension Reduction (DR17) will take place July 9-10 at the Omni William Penn Hotel in Pittsburgh, Pa., coincident with the SIAM Annual Meeting. The workshop will explore several emerging ideas for parameter space dimension reduction, including active subspaces and basis adaptation from uncertainty quantification, inverse regression and sufficient dimension reduction from statistical regression, ridge approximation and recovery from approximation theory, and sloppy models and parameter estimation techniques, along with a range of applications from computational science and engineering. The deadline to submit a two-page abstract for consideration of a 20-minute talk or poster is March 13th. More information can be found at the website.

References
[1] Donoho, D.L. (2000). High-dimensional data analysis: The curses and blessings of dimensionality. In AMS Conference on Math Challenges of the 21st Century.
[2] Saltelli, A., Ratto, M., Andres, T., Campolongo, F., Cariboni, J., Gatelli, D., Saisana, M., & Tarantola, S. (2008). Global Sensitivity Analysis: The Primer, Hoboken, NJ: John Wiley & Sons.
[3] Traub, J.F., & Werschulz, A.G. (1998). Complexity and Information. Cambridge, England: Cambridge University Press.

Paul Constantine is Ben L. Fryrear Assistant Professor in the Department of Applied Mathematics and Statistics at the Colorado School of Mines. Learn more about Paul on his website.