SIAM News Blog
SIAM News

# Training Quantized Deep Neural Networks and Applications with Blended Coarse Gradient Descent

By Jack Xin

In recent years, deep neural networks (DNNs) have seen enormous success in big data-driven applications such as image and speech classification, natural language processing, and health sciences [5, 11]. However, DNNs typically require hundreds of megabytes of memory and billions of flops per inference, making them challenging to deploy in resource-limited environments. Researchers have lately been devoting considerable efforts to the development of low-precision networks for memory and computational savings that nearly maintain full-precision network performance. Quantization—an efficient means of producing low-precision DNNs—gives rise to interesting mathematical issues.

DNNs act on an input vector $$v$$ by repeated composition, written as $$F(v) = A_{L}(F_{L-1}(\cdots F_2(F_1(v))))+\, b_{L}$$, where $$F_l(\cdot)=\max(A_l \cdot + \, b_l,0)$$ [7]; the nonlinearity is the so-called ReLU function, $$A_l$$ represents the weight matrices, and $$b_l$$ refers to the bias vectors. For image and sound data, $$A_l$$ filters and extracts multiscale features from low to high levels across hidden layers $$(l\leq L-1)$$. The output $$F(v)$$ is normalized into class probabilities for a classification task [11]. These DNNs are termed convolutional neural networks (CNNs) because $$A_l$$ performs convolutions [5, 11]. Given labeled samples $$(v^{(n)},u^{(n)})$$, network training minimizes the empirical risk: $$N^{-1} \sum_{n=1}^{N} \, \ell\, ( F(v^{(n)}); u^{(n)})$$. Here $$\ell (\cdot\, ;u)$$ is a smooth function that measures the discrepancy between the network-predicted label from input $$v^{(n)}$$ and the true label $$u^{(n)}$$. Use of stochastic gradient descent (SGD)—a variant of the gradient descent method based on small batches of a large training data set—achieves the minimization [5, 11]. For simplicity, we ignore the forthcoming bias vectors.

Quantization refers to replacement of the ReLU function $$\max(\cdot,0)$$ with a piecewise constant function and restriction of weight values to a discrete set. In activation quantization, the quantized network function is $$F_Q (v) := A_L\, \sigma(A_{L-1}\, \cdots A_2\, \sigma(A_1 \, v, \alpha_1), \cdots, \alpha_{L-1} )$$. The $$l$$th quantized ReLU $$\sigma(x_l,\alpha_l)$$ acts element-wise on vector $$x_l$$ from a previous layer and is parameterized by trainable scalar $$\alpha_l>0$$.  In uniform quantization,

$\label{eq:qrelu} \sigma (x,\alpha ) = \begin{cases} 0, \quad & \mathrm{if}\quad x \leq 0,\\ k\, \alpha, \quad & \mathrm{if}\quad \left(k-1\right)\alpha < x \leq k\, \alpha, \; k = 1, 2, \dots, 2^{b_a}-1,\\ \left(2^{b_a}-1\right)\alpha,\quad & \mathrm{if}\quad x > \left(2^{b_a}-1\right)\alpha, \tag1 \end{cases}$

where $$x$$ is the scalar input, $$b_a$$ is the bit-width, and $$k$$ is the quantization level. For a 4-bit quantization, $$b_a = 4$$ and $$2^{b_a} = 16$$ levels exist, including zero.

The empirical risk minimization problem is non-convex, piecewise constant, and high dimensional for $$\sigma$$ in $$(1)$$. In SGD, the gradients of empirical risk function in $$A_l( l \leq L-1)$$ and $$\alpha_l(l\leq L-2)$$ are almost always zero, or simply zero under auto-differentiation for a deep learning platform; this presents an immediate difficulty. The SGD is stuck with vanishing gradients, though descent directions may still exist. A familiar landscape of this kind is a piecewise-constant spiral staircase that contains a descent direction along its envelope. This large-scale view motivates a descent direction via the notion of coarse gradient [8]. The modified chain rule yields the course gradient wherein the partial derivative $$\sigma_x$$ is replaced by $$\tilde \sigma_x$$, with $$\tilde \sigma$$ as the ramp function or clipped ReLU:

$\label{crelu} \tilde{\sigma}(x,\alpha) = \begin{cases} 0, \quad & \mathrm{if}\quad x \leq 0,\\ x, \quad & \mathrm{if}\quad 0 < x \leq (2^{b_a}-1)\alpha, \\ \left(2^{b_a}-1\right)\alpha,\quad & \mathrm{if}\quad x > \left(2^{b_a}-1\right)\alpha. \tag2 \end{cases}$

The partial derivative $$\sigma_\alpha$$ is piecewise constant with $$2^{b_a}$$ values. Numerical experiments on ImageNet [8] have confirmed the success of coarse gradient descent and revealed that an averaged version of $$\sigma_\alpha$$ with $$3$$ or $$2$$ values is a better choice. The $$3$$-valued $$\sigma_\alpha$$ is

\begin{align}\label{3va} \dfrac{\partial \sigma}{\partial \alpha}(x,\alpha) \approx \begin{cases} 0, \quad & \mathrm{if}\quad x \leq 0,\\ 2^{(b_a-1)}, \quad & \mathrm{if}\quad 0 < x \leq \left(2^{b_a}-1\right)\alpha,\\ 2^{b_a}-1,\quad & \mathrm{if}\quad x > \left(2^{b_a}-1\right)\alpha. \tag3 \end{cases} \end{align}

The middle value $$2^{b_a-1}$$ in $$(3)$$ is the arithmetic mean of the intermediate $$k$$ values in $$(1)$$. The $$2$$-valued $$\sigma_\alpha$$ is equivalent to replacing $$\sigma_\alpha$$ with $$\tilde \sigma_\alpha$$, or zeroing out the middle value of $$(3)$$.

Figure 1. ImageNet validation accuracies versus number of epochs for 1W4A quantization on ResNet-18 with (orange) and without (blue) blending for 3-valued σα. Image courtesy of [8].

The $$b_w$$-bit weight quantization restricts entries of $$A_l$$ (total $$M$$ in all layers) to the set $$\mathbb{Q} = \mathbb{R}_+\times\{\pm 1\}^{M}$$ for $$b_w =1$$ and $$\mathbb{Q} = \mathbb{R}_+\times\left\{0,\pm 1, \dots, \pm(2^{b_w-1}-1)\right\}^{M}$$ for $$b_w\geq2$$. Training of a fully-quantized network seeks $$A_l$$s and $$\alpha_l$$s to minimize the empirical risk, subject to the $$\mathbb{Q}$$ constraint. The classical approach is projected gradient descent (PGD): $$\bf{w}$$$$^{k+1} = {\rm proj}_{\mathbb{Q}} (\bf{w}$$$$^k - \eta \, \tilde \nabla$$$$_{\bf{w}}$$$$f$$$$(\bf{w}$$$$^k))$$, where $$\eta > 0$$ is small, $$f$$ is the objective function, and $$\tilde \nabla$$$$_{\bf{w}}$$ is a proper (coarse) gradient in weight $$\bf{w}$$. The $${\rm proj}_{\mathbb{Q}}(\cdot)$$ has affordable closed-form solutions at $$b_w=1,2$$ and approximate solutions at $$b_w\geq 3$$ [10]. However, a small variation of $$\bf{w}$$$$^k$$ may vanish under $${\rm proj}_{\mathbb{Q}}$$ due to $$\mathbb{Q}$$'s discreteness; this can cause stagnation. The BinaryConnect (BC) method [2] provides a cure with a modified update: $$\bf{w}$$$$^{k+1}_{f} =$$ $$\bf{w}$$$$^{k}_{f} - \eta\, \tilde \nabla$$$$_{\bf{w}}$$$$f$$$$(\bf{w}$$$$^k),\;$$ $$\bf{w}$$$$^{k+1} = {\rm proj}_{\mathbb{Q}} \,$$ $$(\bf{w}$$$$^{k+1}_{f})$$, where the full-precision weight $$\bf{w}$$$$^k_{f}$$ continues to evolve and eventually moves the quantized weight $$\bf{w}$$$$^k \in \mathbb{Q}$$. However, the sufficient descent inequality might not hold for BC even when $$f$$ has a Lipschitz gradient, i.e., $$f($$$$\bf{w}$$$$^{k+1}) - f$$$$(\bf{w}$$$$^k) \leq - c \, \|$$$$\bf{w}$$$$^{k+1} -$$$$\bf{w}$$$$^{k}\|^2$$, for constant $$c>0$$ and small learning rate $$\eta$$. Without sufficient descent, $$\{\bf{w}$$$$^k\}$$ may not converge to a critical point even if $$\{f(\bf{w}^\textit{k})\}$$ does. Blending PGD and BC regains the sufficient descent inequality and generates the following blended coarse gradient decent (BCGD) [8]:

$$\bf{w}$$$$_f^{k+1} = (1-\rho)\, \bf{w}$$$$_f^k + \rho \, \bf{w}$$ $$^k - \eta \, \tilde \nabla_{\bf{w}}$$ $$f(\bf{w}$$$$^k),$$  $$\bf{w}$$$$^{k+1} = {\rm proj}_{\mathbb{Q}} (\bf{w}$$$$_f^{k+1})$$

$(4)$

with a blending parameter $$0<\rho\ll 1$$. At $$\rho=10^{-5}$$, Figure 1 depicts validation accuracies of ResNet-18 on ImageNet at $$(b_w, b_a)=(1,4)$$ or 1W4A versus training epochs with and without blending via a $$3$$-valued $$\sigma_\alpha$$. The accuracies noticeably improve with blending. Table 1 lists top-$$1$$ and top-$$5$$ validation accuracies of full-precision ResNet-18 (32W32A) versus low-precision models trained by BCGD with $$3$$- and $$2$$-valued $$\sigma_\alpha$$. The $$3$$-valued $$\sigma_\alpha$$ performs better at a lower precision. At 4W8A, either choice is within one percent of the full-precision model.

Table 1. ImageNet validation accuracies (in percentage) with blended coarse gradient descent on ResNet-18 for 3- and 2-valued σα. Table courtesy of [8].

Coarse gradients are non-unique. Use of derivatives of a proxy activation function (straight-through estimator) in gradient-based training dates back to the 1950s and the learning of single-layer networks. Researchers have proposed various estimators for multilayer networks [1, 3, 4]; depending on the problem, some perform better than others [9]. Consider a neural network regression problem with binary activation $$\sigma (x)=1_{\mathbb{R}^{+}}(x)$$ and square loss function $$\ell(v,w; Z): = \Big(v^\top \sigma(Z\, w) - (v^*)^\top \sigma(Z\, w^*)\Big)^2$$, where $$v^*\in\mathbb{R}^m$$ and $$w^*\in\mathbb{R}^n$$ are the underlying (nonzero) teacher parameters, and the entries of data matrix $$Z\in\mathbb{R}^{m\times n}$$ are identically distributed unit normal.

We show that the coarse gradient (using the derivative of either ReLU or clipped ReLU) on the population loss function $$E_Z \left[\ell(v,w; Z)\right]$$ forms an acute angle with the underlying true gradient [8, 9]. The negative coarse gradient is a descent direction that minimizes population loss so the corresponding iterates converge to a critical point. If the initialization obeys certain geometric conditions, convergence to the global minimizer $$(v^*,w^*)$$ holds. There is still much more to understand about coarse gradient descent.

Figure 2. Recognition of the spoken word “off” and associated runtimes in milliseconds on an Android app. 2a. Full-precision convolutional neural network (CNN) model. 2b. Binary weight CNN model with 2x speedup. Image courtesy of [6].

We have applied quantized CNNs to object detection and keyword spotting problems [6, 10]. The keyword CNN classifies an audio clip as either silence, an unknown word, or a word on a short list. The CNN has two convolutional layers — one fully-connected layer followed by the output class probabilities. A binary weight CNN $$(b_w=1)$$ maintains the full-precision network’s accuracy (93 percent) within one percent. Both are imported to an app on an Android cellular phone for runtime comparison. In Figure 2, a user says “off” and the app correctly recognizes the word and reports a runtime in milliseconds. The binary model doubles the speed on this particular hardware.

The redundancies of DNNs thus allow the development of complexity reduction methods—such as quantization—with minimal performance loss. Coarse gradient descent is useful to other non-convex, nonsmooth optimization problems as well.

References
[1] Cai, Z., He, X., Sun, J., & Vasconcelos, N. (2017). Deep Learning with Low Precision by Half-wave Gaussian Quantization. In 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, HI.
[2] Courbariaux, M., Bengio Y., & David, J. (2015). BinaryConnect: Training Deep Neural Networks with Binary Weights during Propagations. In NIPS’15: Proceedings of the 28th International Conference on Neural Information Processing Systems — Volume 2 (pp. 3123-3131). Montreal, Canada.
[3] Hinton, G. (2012). Neural Networks for Machine Learning. In Coursera video lectures. Retrieved from https://www.youtube.com/playlist?list=PLoRl3Ht4JOcdU872GhiYWf6jwrk_SNhz9.
[4] Hubara, I., Courbariaux, M., Soudry, D., El-Yaniv, R., & Bengio, Y. (2018). Quantized Neural Networks: Training Neural Networks with Low Precision Weights and Activations. J. Mach. Lear. Res., 18, 1-30.
[5] LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep Learning. Nature, 512, 436-444.
[6] Sheen, S., & Lyu, J. (2019). Median Binary-Connect Method and a Binary Weight Convolutional Neural Network for Word Recognition. In 2019 IEEE Data Compression Conference. Snowbird, UT.
[7] Strang, G. (2018). The Functions of Deep Learning. SIAM News, 51(10), p. 1.
[8] Yin, P., Zhang, S., Lyu, J., Osher, S., Qi, Y., & Xin, J. (2019). Blended Coarse Gradient Descent for Full Quantization of Deep Neural Networks. Res. Math. Sci., 6(14). Preprint, arXiv:1808.05240.
[9] Yin, P., Zhang, S., Lyu, J., Osher, S., Qi, Y., & Xin, J. (2019). Understanding Straight-Through Estimator in Training Activation Quantized Neural Nets. To be presented at the 2019 International Conference on Learning Representations. New Orleans, LA.
[10] Yin, P., Zhang, S., Qi, Y., & Xin, J. (2019). Quantization and Training of Low Bit-Width Convolutional Neural Networks for Object Detection. J. Comp. Math., 37(3), 349-360. Preprint, arXiv:1612.06052v2.
[11] Yu, D., & Deng, L. (2015). Automatic Speech Recognition: A Deep Learning Approach. In Signals and Communication Technology. New York, NY: Springer.

Jack Xin is a professor of mathematics at the University of California, Irvine. His research interests are in the analysis and computation of multiscale and data science problems.