# Unifying Different Perspectives: From Cubism to Convolution

Visual perception as a fundamental sensation that shapes our understanding of the world has long been of interest to both science and art. Neural activities related to various stages of visual perception are associated with different areas of the visual cortex. Objects are first “observed” by cells tuned to elementary stimulus. In subsequent stages, specific regions of the brain—which handle more complex structures—are activated depending on what one is looking at; faces and Chinese characters are very different! Neuroscientists still do not sufficiently understand the integration of elements detected in early stages to create the concept of an object. On the other end of the spectrum, artists are experimenting with the same subject. Cubism is one avant-garde art movement exploring the relation between concept formation and perception. Cubist paintings usually depict objects in parts, from multiple viewpoints simultaneously (see Figure 1). It is difficult (but not impossible) for spectators to “picture” the objects in these paintings by unifying visually-observed pieces in their mind, like solving a virtual puzzle.

**Figure 1.**

*Les Demoiselles d’Avignon*, a painting by Pablo Picasso, embodies some of the bold traits of Cubism. Public domain image.

Many image models fall into two types of representations: local and nonlocal, which offer intrinsically different viewpoints. Local image representations focus on the characterization of local features present in images. Wavelet decomposition, during which wavelets serve as the elementary stimuli in our visual cortex, is perhaps the most classical local image representation. It is widely observed that, given a wavelet basis, an image can often be well-approximated by only a few basis elements (wavelets). Furthermore, the wavelets are both locally-supported and shift- and scale-invariant in the image domain, meaning that they are copies generated by scaling and shifting a wavefront pattern known as the mother wavelet. Therefore, the pattern of the mother wavelet can effectively capture local features of images, with the mother wavelet’s choice determining the class of images being modeled. Dictionary learning is a more flexible local image representation. Instead of prescribing designed local patterns like the mother wavelets in wavelet bases, a dictionary of representative patterns is learned from fixed-size patches extracted from an image or a collection of images (a training dataset). These adaptive patch patterns are then used to more efficiently decompose images consisting of patches similar to the training dataset.

A nonlocal image representation, on the other hand, focuses on repetition rather than decomposition of patterns in an image. Such a model is “nonlocal” because similar patches from one image are not necessarily localized in the image domain. One popular nonlocal image model comes from manifold learning, where image patches are assumed to vary smoothly in the patch space—forming a manifold—with a small degree of freedom (dimension of the manifold) compared to patch size (dimension of the ambient patch space). In practice, a graph whose nodes are sample patches approximates the patch manifold. Furthermore, one can define a diffusion process on the patches with respect to their similarity; the corresponding (graph) Laplacian induces an orthonormal spectral basis that encodes the connection between similar patches.

Considering local wavelet decomposition and nonlocal manifold learning (spectral decomposition) as specific examples, we demonstrate a novel way to combine local and nonlocal image representations by convolution. Given a discrete image \(f \in \mathbb{R}^n\), its decomposition with respect to a \(J\)-level (overcomplete) wavelet basis generated by the mother wavelet \(\psi\) is

\[f = \sum_{i,j}a_{i,j} \psi (2^j (\cdot -i)), \qquad i = 1, \cdot \cdot \cdot, n, \enspace j =1, \cdot \cdot \cdot,J. \tag1\]

Because the wavelet transform is shift-invariant, we can rewrite the above decomposition as a sum of convolutions \(\Sigma_j A_j \ast \psi (2^j \cdot)\), where \(A_j \in \mathbb{R}^n\) is the set of wavelet coefficients \(a._j\) associated with the scaled mother wavelet \(\psi \: (2^j \cdot)\), with translations in the image domain. Alternatively, if we look at patches of size \(2^J\!\times 2^J\) centering on each pixel in the image (with periodic boundary extension), we see that they are decomposed against the same set of basic patterns, i.e., mother wavelets in different scales \(\psi \: (2^j \cdot)\). Therefore, two similar patches, \(p_s, p_t\)—centering at \(s\) and \(t\) respectively—have coefficients \(A_j(s)\) and \(A_j(t)\) that are close for \(j =1, \cdot \cdot \cdot,J\). In other words, coefficient vectors \(A_j\) indicate the similarity between patches. On the other hand, if we construct a graph using all patches \(p_i\), then the spectral basis \(\phi_k, k = 1, \cdot \cdot \cdot, n\) generated from the graph Laplacian in manifold learning is an orthonormal basis of \(\mathbb{R}^n\). Therefore, we can use the spectral basis to decompose the coefficient vectors \(A_j = \Sigma_k c_{j,k} \: \phi_k\), which results in a reformulation of the original image decomposition \((1)\) as a linear combination of convolution components generated from the wavelet basis and spectral basis

\[f = \sum_{j,k}c_{j,k} \phi_k \ast \psi (2^j \cdot). \tag2 \]

In fact, Proposition 1 in [1] shows that given any orthonormal basis \(\psi_j\) in \(\mathbb{R}^\ell\) and any orthonormal basis \(\phi_k\) in \(\mathbb{R}^n\), the bases generate a tight frame of \(\mathbb{R}^n\) consisting of convolution components \(v_{j,k} := \psi_j \ast \phi_k\) with the frame constant \(\sqrt{\ell}\); \(v_{j,k}\) are called convolution framelets.

Combining a local and nonlocal basis results in convolution framelets with stronger representation power than either basis alone. To observe this, we consider a simulated image \(f\) containing two patterns, \(\psi_1\) and \(\psi_2\), whose supports divide the image domain into \(D_1\) and \(D_2\). In this case, the leading (nontrivial) spectral basis vector is \(\phi_1=1_{D_1}-1_{D_2}\) (up to a constant) and the image is thus a linear combination of four convolution framelets \(f=0.5\psi_1\ast(\phi_1+\phi_0)+0.5\psi_2\ast(\phi_0-\phi_1)\), where \(\phi_0=1\) is the trivial spectral basis vector (up to a constant). A pair of local and nonlocal bases can be viewed in the form of an autoencoder [1], a type of neural network whose output is the same as input, with dimensionality reduction on the input. We also find that applying regularization induced by convolution framelets improves the reconstruction result, when compared with regularization on the corresponding nonlocal basis alone.

In general, one can obtain a convolution component \(v=\psi \ast \phi\) by distributing the pattern \(\psi\) in the image domain with respect to the layout \(\phi\); \(v\) inherits the regularity from both \(\psi\) and \(\phi\). Imagine an artist working step by step on a painting. Each time the artist paints part of the painting with a certain type of brush stroke to create a specific pattern, he/she adds a “convolution component” to the painting. The set of patterns to choose from depends on the painting’s style and the artist’s skill set, whereas the layout of the patterns is more closely related to the painting’s content. As with art, there are many ways to represent an image, yet none is optimal. The convolution framelets introduced here present our inspiration from classical models to be further explored in the future.

**References**

[1] Yin, R., Gao, T., Lu, Y.M., & Daubechies, I. (2017). A tale of two bases: Local-nonlocal regularization on image patches with convolution framelets. *SIAM Journal on Imaging Sciences, 10*(2), 711-750.