| May 03, 2021

The Future of Deep Learning Will be Sparse

By Torsten Hoefler, Dan Alistarh, Nikoli Dryden, and Tal Ben-Nun

Deep learning continues to deliver surprising new capabilities for tasks such as image and object classification, game play, translation, and even molecular structure prediction and other significant advancements in the computational sciences. These tasks are often carried out at human or superhuman performance levels, and one could argue that machines take over where human understanding of complex systems ends. The three key ingredients of deep neural networks (DNNs) are data, compute, and models; the latter includes the algorithms that design and train model structure and weights. In fact, the development of more capable deep learning systems is largely fueled by increasingly greater model sizes and amounts of data. OpenAI predicts an exponential growth in model size that appears to be supported by the data; between 2012 and 2018, training efforts grew by 300,000x. Present-day researchers continue to employ scaling with trillion-parameter models. Yet although today’s large models provide excellent performance, they seem relatively inefficient and are expensive to evaluate.

Over the last 10 years, both Moore’s law and innovations in high-performance computing—such as graphics processing unit (GPU) accelerators—have driven the first wave of progress in deep learning. With the end of Moore’s law now in sight, the “free” cost reductions in computation and storage to which we are accustomed will likely grind to a halt in the near future. However, inspiration from biology could lead to algorithmic solutions that pertain to the original neural networks from the 1960s. For instance, if we draw an analogy to biological brains—which are many orders of magnitude more energy efficient for similar tasks—we find that their connectivity is rather sparse. In fact, animal brains tend to become sparser as they increase in size. If we circle back to the aforementioned key ingredients, more data will remain available but more compute may break the bank. A growing body of research is investigating approaches that engineer sparsity into deep learning models to fuel the next decade of success in artificial intelligence. Existing works indicate that sparsity can achieve speedups between 10 and 100x in the near future; even more may be possible at higher sparsity levels.

But what exactly is “sparsity” in the context of deep learning? This is a surprisingly complex topic, and more than 300 research papers have described different techniques that address several aspects of the following fundamental questions: What should we sparsify? How can we sparsify? When should we sparsify? How can we integrate sparsity into training?

Figure 1. An overview of sparsification approaches in deep learning. Figure courtesy of [2].

Sparsification comes in many shapes and forms, and we can broadly categorize them into two classes: model sparsity and ephemeral sparsity. In model sparsity, the sparse structure is related to the model itself; we can remove neurons, weights, or even whole substructures like filters or heads. In contrast, ephemeral sparsity is tied to the training or inference process and changes with each input example. This type of sparsity is well-known outside the realm of performance in operators such as Dropout and rectified linear units, where it prunes connections and activations respectively. Figure 1 depicts an overview and coarse classification of the various sparsity approaches in use today.

While sparsity reduces the number of arithmetic operations to be executed and weights to be stored, it incurs additional control and storage overheads that represent the sparse structure. In scientific computing, we would traditionally not even consider sparsity below 99 percent to be efficient. However, use of specialized architectures ensures that even 50 percent sparse workloads can yield performance benefits; NVIDIA’s Ampere microarchitecture offers to multiply matrices with 50 percent sparsity nearly twice as quickly. Today’s DNN models achieve sparsity ranges from 50 to 95 percent without significant accuracy loss, and ephemeral sparsity may yield additional savings. Yet we still do not know whether block-wise sparsity, which requires significantly lower representation and control overheads, can yield much of the benefits of fine-grained sparsity. Experiments show that the sparsity-accuracy tradeoff—which we define as “parameter efficiency” [2]—is lower, but the computational benefits may outweigh this loss. In practice, these benefits depend on both the problem and target computer architecture.

Figure 2. Locally connected layers employ different weights for each region. Figure adapted from [1].

We must still examine many additional topics to gain a full picture of sparsity in deep learning. For example, how do we select elements for removal? Techniques range from simple “leave one out and check the quality” approaches, various saliency measures, and learned gating functions and regularizations to selection schemes based on linear or quadratic models of the loss function. If we sparsify during training, we may need to regrow other elements after removal to maintain a balance. We can do so randomly—based on the loss function’s gradient—or with preferential attachment rules that are inspired by the brain and power-law graphs. Ephemeral sparsity approaches can lead to substantial memory and communication savings, ultimately yielding significant speedups with the right runtime support for sparse parallel reductions [3]. Our full overview paper on this subject provides an extensive outline of the details of sparse methods for deep learning [2].

DNNs are also beginning to gain popularity in many, if not most, scientific domains. Here, sparsity could be important even when exploring model options because it often leads to quality improvements in the low sparsity regime. For example, post-processing of weather and climate data can take advantage of locally connected layers to correct for biases in numerical model predictions for different parts of the Earth [1]. In a locally connected layer, each output neuron is only connected to a spatially constrained set, much like a stencil computation. In contrast, a fully connected layer—which connects each input neuron to every output neuron—exploits the domain’s physically induced sparsity. The difference with convolutional layers that researchers use for image recognition is that the weights are not the same for all spatial points and can thus learn specific properties of each spatial region (see Figure 2). For example, the function that one applies on top of the ocean can differ from the function that one applies on top of a continent. We expect that such sparse techniques will become more relevant in deep learning for scientific computing and physical systems. As such, scientists should embrace them from the very beginning.

The future is bright for sparse deep learning; we will clearly begin to increase sparsity in the very near future, and most vendors are working on architectural support for their accelerators or software pipelines. Yet a multitude of challenges remain. How sparse can we go? Can we train with full sparsity? How can we understand sparsity’s power? Furthermore, early research indicates that sparse neural networks may amplify existing biases and compromise fairness. But despite the many enduring complicated problems, sparsity in deep learning will surely rise in practical systems.

References
[1] Grönquist, P., Yao, C., Ben-Nun, T., Dryden, N., Dueben, P., Li, S., & Hoefler, T. (2021). Deep learning for post-processing ensemble weather forecasts. Phil. Trans. Roy. Soc. A., 379(2194).
[2] Hoefler, T., Alistarh, D., Ben-Nun, T., Dryden, N., & Peste, A. (2021). Sparsity in deep learning: Pruning and growth for efficient inference and training in neural networks. Preprint, arXiv:2102.00554.
[3] Renggli, C., Ashkboos, S., Aghagol-zadeh, M., Alistarh, D., & Hoefler, T. (2019). SparCML: High-performance sparse communication for machine learning. In SC ’19: Proceedings of the international conference for high performance computing, networking, storage and analysis. New York, NY: Association for Computing Machinery.

Torsten Hoefler is a professor of computer science at ETH Zürich. His main research interests lie in large-scale supercomputing and networking, as well as deep learning systems. Dan Alistarh is an assistant professor at the Institute of Science and Technology Austria and the machine learning research lead at Neural Magic, Inc. His research focuses on efficient algorithms that enable scalable deep learning training and inference. Nikoli Dryden is an ETH Postdoctoral Fellow at ETH Zürich. His research focuses on high-performance deep learning and machine learning for science. Tal Ben-Nun is a postdoctoral fellow at ETH Zürich. His research interests include large-scale machine learning for scientific computing, learnable representations of code, and high-performance programming models.