| December 01, 2017

The World’s Most Fundamental Matrix Equation

It has been called the world’s most fundamental matrix equation (WMFME), but it is just the associative law \(A(BC) = (AB)C\) for matrix multiplication. This simple equation directly yields many less obvious formulas and has important implications for computation.

Consider the symmetric form of the WMFME in which \(C=A\), so that if \(A\) is \(m \times n\) then \(B\) is \(n\times m\). The WMFME is then equivalent to \((I_m+AB)A = A(I_n+BA)\), which yields

\[(I_m + AB)^{-1} = I_m - A(I_n + BA)^{-1}B\]

(to see the connection, just multiply both sides on the left by \(I_m + AB\)). This is sometimes called the matrix inversion lemma and while it can be proved in many ways, it is really just a manifestation of the WMFME. In fact, it is essentially the well-known Sherman-Morrison-Woodbury formula, in which context we usually think of \(n\) as much smaller than \(m\), so that \(I_m + AB\) is a low-rank update to \(I_m\).

Cartoon created by mathematician John de Pillis.

The WMFME is a mathematical—but not computational—equivalence. Generations of students have learned that the product \(xy^T z\), where \(x\), \(y\), and \(z\) are \(n\)-vectors, should be written and evaluated as \(x(y^Tz)\) (\(O(n)\) flops) rather than \((xy^T)z\) (\(O(n^2)\)) flops). More generally, deciding where to put the parentheses in a matrix product \(A_1 A_2\dots A_k\) to minimize the number of operations in the evaluation is a nontrivial problem, known as the matrix chain multiplication problem. Textbooks show that the problem (which does not ask for the matrix product to actually be formed) can be solved by dynamic programming in \(O(k^3)\) operations, but an \(O(k\log k)\) algorithm was devised by T.C. Hu and M.T. Shing [3].

The WMFME is also relevant in evaluating the Jacobian of a composition of functions. Consider \(g(x) = f_3(f_2(f_1(x)))\), where \(f_k\) maps \(\mathbb{R}^{n_k}\) to \(\mathbb{R}^{n_{k+1}}\). The chain rule allows us to express the Jacobian of \(g\) in terms of the Jacobians of the constituent functions:

\[(J_g(x) = J_{f_3}\bigl( f_2(f_1(x)) \bigr) J_{f_2}\bigl( f_1(x) \bigr) J_{f_1}( x).\]

Depending on the dimensions \(n_1\), \(n_2\), \(n_3\), and \(n_4\), evaluating \(J_g\) as \((J_{f_3} J_{f_2}) J_{f_1}\) or \(J_{f_3} (J_{f_2} J_{f_1})\) may be more efficient. More subtly, the expressions \(\left\{\|J_g z\|_2/\|z\|_2: z\in\mathbb{R}^{n_1}\right\}\) and \(\left\{\|w^TJ_g\|_2/\|w^T\|_2: w\in\mathbb{R}^{n_4}\right\}\) both give the \(2\)-norm of \(J_g\). If \(n_1 \ne n_4\) then one of these expressions will be easier to evaluate analytically, as it involves fewer unknowns. This is a form of duality, and can be exploited to find explicit expressions for condition numbers [1].

Algorithmic differentiation (also called automatic differentiation, or AD), is a well-established subject of renewed interest because of its use in backpropagation in neural networks and deep learning. The difference between the forward and reverse modes of AD is essentially the order in which Jacobians are accumulated, rather similarly to the previous paragraph’s discussion. I originally learned this interpretation from Mike Giles; it can also be found in the book [2] by Andreas Griewank and Andrea Walther. Gil Strang includes this timely topic in his forthcoming book [5] as well.

A rather different implication of the WMFME is the equation \(A(BA)^{1/2} = (AB)^{1/2}A\), for any \(A\) and \(B\) such that the square roots exist. One can obtain the equation by iterating the symmetric form of the WMFME to attain \(A(BA)^k = (AB)^kA\) for all positive integers \(k\), concluding that \(Ap(BA) = p(AB)A\) for any polynomial \(p\), and using the definition of matrix function via an interpolating polynomial.

A closer look at the matrix inversion formula leads to the realization that the conditions for the existence of the inverses of \(I_m + AB\) and \(I_n + BA\) must be the same. How can this be, given that \(AB \ne BA\) in general and that the two matrices can be of different sizes? In fact, \(AB\) and \(BA\) have quite a lot in common. They have the same trace, and the same determinant when \(m=n\). Indeed, they have the same nonzero eigenvalues. More precisely, the sets of eigenvalues of \(AB\) and \(BA\) are the same except when \(m\ne n\), in which case the larger matrix has additional zero eigenvalues. Therefore, \(I_m + AB\) is nonsingular precisely when \(I_n + BA\) is.

Finally, we note that if \(A\) and \(B\) are symmetric positive definite, then \(ABA\) is symmetric positive definite, as it is congruent to \(B\). Certain matrix trace considerations in quantum physics lead to the question of whether an arbitrary product formed from multiple copies of \(A\) and \(B\) (such as \(A^3BA\) or \(BAB^2AB\) has positive eigenvalues (the product is in general not symmetric). The answer is no, but examples with a negative eigenvalue are hard to find [4].

References
[1] Baboulin, M., & Gratton, S. (2009). Using dual techniques to derive componentwise and mixed condition numbers for a linear function of a linear least squares solution. BIT Num. Math., 49(1), 3-19.
[2] Griewank, A., & Walther, A. (2008). Evaluating Derivatives: Principles and Techniques of Algorithmic Differentiation. Philadelphia, PA: Society for Industrial and Applied Mathematics.
[3] Hu, T.C., & Shing, M.T. (1984). Computation of matrix chain products. Part II. SIAM J. Comput., 13(2), 228-251.
[4] Johnson, C., & Hillar, C. (2002). Eigenvalues of words in two positive definite letters. SIAM J. Matrix Anal. Appl., 23, 916-928.
[5] Strang, G. (2018). Linear Algebra and Learning from Data. Wellesley, MA: Wellesley-Cambridge Press. To appear, http://math.mit.edu/~gs/learningfromdata/.

Nicholas Higham is the Richardson Professor of Applied Mathematics at The University of Manchester. He is the current president of SIAM.