# Learning with Nonnegative Matrix Factorizations

Identifying the underlying structures within large data sets is a crucial data science task, and numerous techniques exist to successfully do so. One of the oldest approaches is linear dimensionality reduction (LDR), which assumes that each data point is generated from a linear combination of a small number of basis elements. Given a data set of \(n\) vectors \(x_i \in \mathbb{R}^p (1 \leq i \leq n)\), LDR looks for a small number \(r\) of basis vectors \(u_k \in \mathbb{R}^p (1 \leq k \leq r \ll n)\), such that each data point is well approximated using a linear combination of these basis vectors; that is,

\[x_i \approx \sum_{k=1}^r u_k v_{i}(k) \; \text {for all} \enspace i,\]

where the \(v_{i}(k)\)s are scalars. LDR is equivalent to low-rank matrix approximation (LRA) since one can equivalently write the previous equation as

\[\underbrace{[x_1, x_2, \dots, x_n]}_{X \in \mathbb{R}^{p \times n}} \approx

\underbrace{[u_1, u_2, \dots, u_r]}_{U \in \mathbb{R}^{p \times r}}

\underbrace{[v_1, v_2, \dots, v_n]}_{V \in \mathbb{R}^{r \times n}}, \]

where \(v_i \in \mathbb{R}^r\) is the \(i\)th column of \(V\) and contains the weights to approximate \(x_i\) in the basis \(U\); i.e., \(x_i \approx U v_i\) for all \(i\). The rank-\(r\) matrix \(UV\) hence approximates the data matrix \(X\).

When the solution \((U,V)\) minimizes the sum of the squares of the residuals \(X-UV\) (least squares), LRA recovers principal component analysis (PCA), which can be computed via the singular value decomposition. LRA is a workhorse in numerical linear algebra and relevant to a wide array of applied mathematics, including blind source separation in signal processing; regression, prediction, clustering, and noise filtering in statistics and data analysis; and system identification and model reduction in control and systems theory. Although PCA has existed for more than a century, LRA has gained much momentum in the last 20 years. The reason for this is mostly twofold: (i) data analysis has become increasingly important in recent years, particularly with the current big data era, and (ii) LRA—despite its simplicity—is very powerful since low-rank matrices can effectively approximate many high-dimensional data sets [3].

Researchers use LRA models as either compression tools or as direct means to identify hidden structures in data sets. Many variants of LRA have emerged over the past few years, with two key differences. First, the error measure can vary and should thus be chosen depending on the noise statistic assumed on the data. For example, PCA employs least squares, which implicitly assumes independent and identically distributed Gaussian noise. Using the \(\ell_1\) norm leads to robust PCA, which better tolerates outliers. Another significant variant occurs when data is missing, for which the error measure can only be based on the observed entries. Researchers have successfully utilized it for recommender systems that predict users’ preferences for certain items. This was true of the Netflix Prize, a competition that sought the most efficient filtering algorithm to predict user film ratings based solely on previous ratings.

Second, one can impose varied constraints on the factors \(U\) and \(V\) to achieve different goals. In PCA, the factors are orthogonal and ordered in terms of importance. If a user instead wants to explain each data point with the minimum number of basis vectors, every column of matrix \(V\) should contain as many zero entries as possible. This LRA variant is called sparse component analysis and is closely related to dictionary learning and sparse PCA.

Among LRA models, nonnegative matrix factorization (NMF) requires factor matrices \(U\) and \(V\) to be component-wise nonnegative. Pentti Paatero and Unto Tapper first introduced NMF in 1994; it then gained momentum with the seminal paper of Daniel Lee and Sebastian Seung in 1999 [2]. Because of the additional nonnegativity constraints, NMF has a higher fitting error than PCA. Therefore, one should employ it when the factors \(U\) and \(V\) allow for the identification of hidden structures in a data set, e.g., when their entries are physical quantities.

### Blind Hyperspectral Unmixing

A hyperspectral image (HSI) records the spectral signature of each pixel by measuring the reflectance (fraction of reflected, nonnegative light energy) for up to 200 different wavelengths. Physical materials reflect different amounts of light at different wavelengths, and are hence identifiable by their spectral signatures. Blind hyperspectral unmixing (blind HU) aims to recover the materials present in an HSI—called the endmembers—along with the abundances of every endmember in each pixel (which are also nonnegative) without prior knowledge of the materials or their properties. The linear mixing model (LMM) is the most standard model for blind HU. It assumes that each pixel’s spectral signature is a linear combination of the endmembers’ spectral signatures, where the weights are the abundances of the endmembers in that pixel. For example, if a pixel contains 50 percent grass and 50 percent road surface, its spectral signature will be equal to half of the spectral signature of grass plus half of the spectral signature of the road surface. This is because half of the light hitting that pixel is reflected by the grass and the other half is reflected by the road. When constructing matrix \(X\) so that each column is a pixel’s spectral signature, the LMM translates into NMF — where the columns of \(U\) contain the endmembers’ spectral signatures and the columns of \(V\) contain the abundances of these endmembers in each pixel. Figure 1 illustrates this decomposition.

**Figure 1.**Blind hyperspectral unmixing of an urban image taken above a Walmart in Copperas Cove, Texas, using nonnegative matrix factorization, with

*r*=6 (162 spectral bands, 307x307 pixels). Each factor corresponds to the spectral signatures of an endmember (a column of

*U*) with its abundance map (a row of

*V*). Light tones represent high abundances. Image courtesy of Nicolas Gillis.

### Audio Source Separation

Given an audio signal recorded from a single microphone, one can construct a magnitude spectrogram. The signal is split into small time frames with some overlap (usually 50 percent). A user applies the short-time Fourier transform on each time frame and obtains the corresponding column of \(X\) by taking the magnitude of the Fourier coefficients. A piano recording of “Mary Had a Little Lamb,” whose musical score is shown below, is a very simple monophonic signal for illustrative purposes.

The sequence is composed of three notes: \(C_4\), \(D_4\), and \(E_4\). Figure 2 depicts the NMF decomposition of the magnitude spectrogram using \(r=4\). The three notes are extracted as the first three columns of \(U\) (the signature of each note in the frequency domain), and a fourth “note” (last column of \(U\)) captures the first offset of each note in the musical sequence (common mechanical vibration acting in the piano just before triggering a note). The rows of \(V\) provide the activation of each note in the time domain. NMF is consequently able to blindly separate the different sources and identify which source is active at a given moment in time. Note that NMF reaches its full potential in polyphonic music analysis, when several notes and even several instruments are played simultaneously.

**Figure 2.**Decomposition of the piano recording “Mary Had a Little Lamb” using nonnegative matrix factorization.

**2a.**Amplitude spectrogram

*X*in decibels.

**2b.**Basis matrix

*U*corresponding to the notes

*C*

_{4},

*D*

_{4}, and

*E*

_{4}.

**2c.**An offset activation matrix

*V*that indicates when each note is active. Image courtesy of Nicolas Gillis.

Other applications of NMF include extracting parts of faces from sets of facial images, identifying topics in a collection of documents, learning hidden Markov models, detecting communities in large networks, analysing medical images, and decomposing DNA microarrays [1]. In these instances, NMF’s power stems from the interpretability of factors \(U\) and \(V\). It would not be possible to interpret the PCA factors for the two aforementioned applications as done with NMF (e.g., as physical quantities that can only take nonnegative values). Due to the nonnegativity constraints, NMF’s \(U\) and \(V\) factors automatically have some degree of sparsity.

Key considerations of NMF usage inspire important research questions for both NMF and other LRA models. A first critical question is as follows: *How does one ensure that the recovered factors correspond to the true factors that generated the data?* For instance, how can one be certain that matrix \(U\) will correspond to the true endmembers in blind HU or the true sources in audio source separation? In fact, NMF decompositions are in general not unique and thus require additional assumptions (such as sparsity of \(U\) and/or \(V\)) to guarantee that the computed decomposition matches that which generated the data [1]. A second vital issue concerns the *practical computation of NMFs*. NMF is NP-hard. In practice, most NMF algorithms are heuristics based on alternatively updating \(U\) and \(V\). Yet in some cases, one can provably solve NMF in an efficient manner. For example, this is possible in blind HU in the presence of pixels containing only a single endmember. Other areas of ongoing research include model-order selection (choice of \(r\)) and choice of the objective function.

Despite the surge in interest over the past 20 years, the intricacies of solutions to this straightforward but subtly difficult problem mean that an abundance of elegant theoretical opportunities are still available for NMF. Given the breadth and practicality of its applications, NMF is clearly gaining momentum as a fundamental tool for understanding latent structure in data.

**Acknowledgments:** The author acknowledges the Fonds de la Recherche Scientifique-FNRS and the Fonds Wetenschappelijk Onderzoek-Vlaanderen under EOS project no. O005318F-RG47, as well as the European Research Council (ERC Starting Grant no. 679515). He is thankful for the feedback of his colleagues—particularly Tim Marrinan, Jérémy Cohen, and Valeria Simoncini—on this article.

**References**

[1] Fu, X., Huang, K., Sidiropoulos, N.D., & Ma, W.K. (2019). Nonnegative matrix factorization for signal and data analytics: Identifiability, algorithms, and applications. *IEEE Sig. Process. Mag., 36*(2), 59-80.

[2] Lee, D.D., & Seung, H.S. (1999). Learning the parts of objects by non-negative matrix factorization. *Nature, 401*, 788-791.

[3] Udell, M., & Townsend, A. (2019). Why are big data matrices approximately low rank? *SIAM J. Math. Data Sci., 1*(1), 144-160.