| December 03, 2018

Randomized Projection Methods in Linear Algebra and Data Analysis

The management and analysis of large complex datasets give rise to mathematical and algorithmic challenges that have intrigued applied mathematicians in recent years. A common situation is a dataset that apparently lives in high-dimensional space, even though evidence suggests that it is interpretable as having a lower “intrinsic” dimension. Finding a way to represent such a dataset using fewer parameters enables its efficient storage, transmission, and analysis to uncover additional structures.

Informally, one might say that the difficulty that arises as a problem’s dimensionality grows is that there are just too many places to look. The unit cube in \(\mathbb{R}^{n}\) has \(2^{n}\) corners, which is a lot even for, say, \(n=100\). Curiously, one’s best bet when facing a vast search space such as this is often to conduct the search completely randomly. I will illustrate how this works in practice by reviewing two recently developed algorithms — one for factorizing matrices and the other for finding approximate nearest neighbors.

Linear Approximation (Principal Component Analysis)

A classical and well-understood dimension reduction problem—given a set of points \(\{a_j\}^n_{j=1}\) in \(\mathbb{R}^m\)—is to find a linear subspace \(L\) that to some degree of approximation contains all points. A common technique for finding \(L\) is to form the \(m\times n\) matrix \(\mathbf{A} = [a_1, a_2,\dots,a_n]\) and compute its singular value decomposition (SVD)

\[ \mathbf{A} = \sum_{j=1}^{p}\sigma_{j}\,\mathbf{u}_j \mathbf{v}_j^*, \]

where \(p = \min(m,n)\), \(\{\sigma_{j}\}_{j=1}^{p}\) are the singular values of \(\mathbf{A}\) (ordered so \(\sigma_{1} \geq \sigma_{2} \geq \cdots \geq \sigma_{p} \geq 0)\), and \(\{\mathbf{u}_{j}\}_{j=1}^{p}\) and \(\{\mathbf{v}_{j}\}_{j=1}^{p}\) are the left and right singular vectors of \(\mathbf{A}\). For any \(k \in \{1,2,\dots,\,p\}\), the subspace \(L\) spanned by the first \(k\) left singular vectors \(\{\mathbf{u}_{j}\}_{j=1}^{k}\) is in certain regards “optimal.”

The aforementioned problem arises in many applications. An example occurs in statistics, when one attempts to fit observed high-dimensional data to an assumption that such data represent samples drawn from a multivariate normal distribution driven by a small number of significant components.

Randomized SVD

The problem of finding an optimal linear subspace has a clean and simple mathematical solution, but a user must determine how to compute the dominant \(k\) singular values and singular vectors of the given data matrix \(\mathbf{A}\). Standard linear algebra routines, such as those in LAPACK, often return the full decomposition. This computation’s flop count, \(O(mn\min(m,n))\), is typically unaffordable for large matrices. Alternatively, one could compute a partial factorization using an algorithm like column pivoted QR or Arnoldi. While this would reduce the asymptotic flop count to \(O(mnk)\), these algorithms can be difficult to efficiently implement on modern communication-constrained hardware.

Another option, which is often highly competitive, is the recently-developed randomized SVD (RSVD) [1], which draws a Gaussian random matrix \(\mathbf{G}\) of size \(n\times k\) to create a sampling matrix \(\mathbf{Y} = \mathbf{A} \mathbf{G}\). In many situations, the columns of matrix \(\mathbf{Y}\) form a good approximate basis for the column space of \(\mathbf{A}\). To identify approximations to the dominant modes of an SVD of \(\mathbf{A}\), one must orthonormalize \(\mathbf{Y}\)'s columns to find a basis for its column space before projecting the matrix \(\mathbf{A}\) down to this subspace and using deterministic methods to compute a full SVD of the resulting small matrix.

To attain accuracy comparable to the best possible approximation of rank \(k\), the RSVD must draw a few extra samples. To this end, let us select an oversampling parameter \(p\) (\(p = 5\) or \(p = 10\) are generally good) and proceed as follows:

Draw an \(n\times (k+p)\) Gaussian random matrix \(\mathbf{G}\)
Form the \(m \times (k+p)\) sampling matrix \(\mathbf{Y} = \mathbf{A} \mathbf{G}\)
Form an \(m\times (k+p)\) orthonormal matrix \(\mathbf{Q}\), such that \(\mathbf{Y} = \mathbf{Q} \mathbf{R}\)
Form the \((k+p)\times n\) matrix \(\mathbf{B} = \mathbf{Q}^* \mathbf{A}\)
Compute a (small) SVD \(\mathbf{B} = \hat{\mathbf{U}}\:\mathbf{D}\:\mathbf{V}^*\)
Form the matrix \(\mathbf{U}=\mathbf{Q} \: \hat{\mathbf{U}}\).

The result is an approximate factorization \(\mathbf{A} \approx \mathbf{U}\mathbf{D}\mathbf{V}^*\), where the diagonal matrix \(\mathbf{D}\) holds approximations to the singular vectors of \(\mathbf{A}\) and the tall, thin matrices \(\mathbf{U}\) and \(\mathbf{V}\) have orthonormal columns. The RSVD has the same \(O(mnk)\) complexity as existing techniques for partial SVD computation. However, its interaction with the matrix \(\mathbf{A}\) via two matrix-matrix multiplications offers compelling advantages, as it often leads to high practical execution speed. Furthermore, it eases the method’s use in situations where \(\mathbf{A}\) may be stored on slow memory, such as a hard drive, or across a distributed computing system.

The RSVD produces a factorization with near-optimal accuracy in scenarios where \(\mathbf{A}\)'s singular values decay reasonably rapidly. In the case of matrices whose singular values decay slowly, such as measured statistical data, combining the RSVD with a small number of steps of classical subspace iteration is imperative. In practice, we would then replace the computation on the aforementioned line (2) with something along the lines of \(\mathbf{Y} = \mathbf{A} \: (\mathbf{A}^* \: (\mathbf{A}\mathbf{G}))\) [1].

Variations of the RSVD

The basic algorithmic template of the RSVD is easily adaptable to specialized environments. For instance, one can reorganize the computation in such a way that the algorithm only needs to see each entry of the matrix once; this enables the processing of huge matrices resistant to storage [1]. Replacing the Gaussian random matrix with a structured random projection, such as a subsampled and randomized FFT [1], can reduce the flop count from \(O(mnk)\) to \(O(mn\log k)\).

The Johnson-Lindenstrauss Lemma and Applications to Nonlinear Problems

While linear approximation is a tremendously powerful tool, many applications require techniques that are inherently nonlinear. Let us again consider a cloud of points \(\{\mathbf{a}_{j}\}_{j=1}^{n}\) in \(\mathbb{R}^{m}\), where \(m\) is large but information suggests that the points belong to some nonlinear manifold of lower dimension. To further complicate things, we must typically also account for noise in the model, meaning that each point may shift slightly away from the true lower-dimensional structure. The problem of discovering this structure and representing it in a user-friendly way is a subject of vigorous research. Key challenges include determining how to computationally resolve certain recurring subtasks—such as clustering the points—or identifying each point’s nearest neighbors to form a connectivity graph. Standard algorithms for resolving these tasks scale unfavorably as the dimension \(m\) of the ambient space increases, which naturally leads to a search for embeddings of the points into lower-dimensional spaces that preserve important geometric properties. A classical result in this direction is the Johnson-Lindenstrauss lemma, which states that for any \(\epsilon > 0\) and any dimension \(d\) satisfying

\[d \geq 4\,\left(\frac{\epsilon^{2}}{2} - \frac{\epsilon^{3}}{3}\right)^{-1}\,\log(n),\]

there exists a map \(f\,\colon\,\mathbb{R}^{m} \rightarrow \mathbb{R}^{d}\) that approximately preserves distances in the sense that

\[(1 - \epsilon)\,\|\mathbb{a}_{i} - \mathbb{a}_{j}\|^{2} \leq \|f(\mathbb{a}_{i}) - f(\mathbb{a}_{j})\|^{2} \leq
(1 + \epsilon)\,\|\mathbb{a}_{i} - \mathbb{a}_{j}\|^{2} \]

for every pair \(\{\mathbb{a}_{i},\mathbb{a}_{j}\}\) of points. The map \(f\) is often built via stochastic techniques (e.g., \(f(\mathbf{x})=d^{-1/2}\mathbf{Gx}\), where \(\mathbf{G}\) is again a Gaussian random matrix, now of size \(d \times n\)).

The Johnson-Lindenstrauss lemma is compelling from a theoretical point of view; the reduced dimension \(d\) depends exclusively on the number of points \(n\) (not the dimension \(m\) of the ambient space), and the dependence is only logarithmic with respect to \(n\). But from a practical point of view, the lemma is by itself of limited use — the scaling with respect to the distortion parameter \(\epsilon\) is ghastly. For instance, if \(\epsilon=0.1\) then \(d \gtrapprox 800\,\log(n)\), which is far too high a dimension for most classical algorithms to be practicable and still leaves substantial distortions of 10 percent. While it is certainly possible to improve the original Johnson-Lindenstrauss results, most known randomized dimension reduction techniques either exhibit very sizable distortions or an image space of unacceptably high dimension.

Randomized Nearest Neighbor Search and a Two-stage Sampling Strategy

A delightful algorithm proposed by Peter Jones, Andrei Osipov, and Vladimir Rokhlin [2]—designed to solve the aforementioned approximate nearest neighbor problem—illustrates how to harness the power of randomized projections without badly sacrificing accuracy. Given a small integer \(k\) and a set of points \(\{\mathbb{a}_{j}\}_{j=1}^{n}\) in \(\mathbb{R}^{m}\), where \(m\) is large, let us find each point's \(k\) nearest neighbors. This seems ideally suited for a Johnson-Lindenstrauss projection-type approach. However, as we have seen, if \(d\) is small enough for classical methods, the introduced distortions would produce many mistaken nearest neighbors. While a single instance of the outlined procedure is exceedingly unreliable, the situation improves dramatically with several repetitions of the experiment [2]. Say we draw 10 different random projections and perform a low-dimensional search for each one. Every search yields a list of candidates for any given point’s nearest neighbors. Now we simply form for each point the union of these lists. Finally, we return to the original dataset to compute the actual distances in \(\mathbb{R}^{m}\), keep the \(k\) best results, and discard the rest.

Both the RSVD and the randomized nearest neighbor search manage to find accurate results despite randomized projections’ introduction of substantial distortions. Perhaps the trick is a two-stage recipe:

Use randomized projections to inexpensively develop a rough sketch that tells us where to look
Use high-precision deterministic methods to compute a precise answer.

The second step allows one to aggressively oversample in the first stage, worrying only about minimizing the risk of missing important information. Any unnecessary information is then filtered out in the second stage.

Here I have described two randomized algorithms designed to help process large datasets in high-dimensional spaces. These algorithms are typically more accurate than traditional randomized algorithms in scientific computing such as Monte Carlo. One may perhaps argue that they are more closely related to methods like randomized QuickSort, in that the algorithm’s randomized aspect primarily affects the runtime. Algorithms of this type reliably produce highly accurate answers and are forgiving in terms of random number generation.

References
[1] Halko, N., Martinsson, P.-G., & Tropp, J.A. (2011). Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM Rev., 53(2), 217-288.
[2] Jones, P.W., Osipov, A., & Rokhlin, V. (2013). A randomized approximate nearest neighbors algorithm. Appl. Comp. Harm. Anal., 34(3), 415-444.

Per-Gunnar Martinsson holds the W.A. “Tex” Moncrief, Jr. Chair in Simulation-Based Engineering and Sciences at the University of Texas at Austin, where he is also a professor of mathematics. He has previously held faculty positions at the University of Oxford, the University of Colorado at Boulder, and Yale University.