How We Can All Play the High-performance Game: From Laptops to Supercomputers

By Michael Bader, Hatem Ltaief, Lois Curfman McInnes, and Rio Yokota

The changing landscape of heterogeneous computing offers both unprecedented power and unprecedented complexity [7]. Given the associated challenges and opportunities, we discuss the necessity of international collaboration to develop open-source portable scientific software ecosystems, including the required functionality for advanced scientific computing. Such collaboration will enable all members of our community to fully exploit the computational power of hybrid machines—from laptops and desktops to clusters, supercomputers, and beyond—and advance scientific discovery.

In a recent letter to the editor in SIAM News, Nick Trefethen called attention to the divergence in computing performance on high-end supercomputers versus everyday laptop and desktop machines [12]. His article raises important considerations about the evolving landscape of computer architectures and implications for the future of computational science and high-performance computing (HPC). As current officers of the SIAM Activity Group on Supercomputing (SIAG/SC), we respond by considering heterogeneous computing architectures that are now readily available at all scales of computing, including on cell phones, laptops, desktops, and clusters. Heterogeneous designs—typically combinations of both central processing units (CPUs) and graphics processing units (GPUs) [11]—offer the potential of unprecedented computing power in a given computing environment (say, on a desktop); however, their extraordinary complexity means that most people cannot fully take advantage of that performance in their own local computing environments.

Trefethen points out that the trend in computational power of the average desktop machine (shown by the solid pink line in Figure 1) has advanced several orders of magnitude less than that of the largest machines on the TOP500 list. Indeed, most typical users of laptops and desktops readily write CPU-only code and achieve the level of performance in the figure.

Architectural Trends: Processors are Getting Faster, But Only With Parallel Codes

Figure 1. Performance trends for regular desktops (solid pink line) and accelerated desktops (solid green line). A fixed budget no longer supports this trend. An NVIDIA H100 graphics processing unit costs roughly $30,000; when we talk about performance, we are probably assuming that the system is actually a high-end server. Figure augmented from a graph that appeared in [12], which was augmented from a TOP500 Project graph.

In the quest towards petascale and exascale computing, energy efficiency requirements have forced architectures to extreme parallelism and complicated memory hierarchies. Even mainstream laptops and desktops are now parallel hybrid architectures, including both CPUs and GPUs. Figure 1 demonstrates that when we fully leverage modern CPUs and GPUs, the growth of laptop and desktop computing power actually aligns with the growth of extreme-scale computing. However, the traditional approach to computing—which only employs the simple-to-use CPUs and omits the more difficult GPUs—leaves behind an enormous fraction of untapped potential in the machines. A huge challenge therefore remains: Given the architectural complexity of these hybrid systems and the limited performance-portable, open-source software that is tailored for these new architectures, how can we all play the HPC game?

Extracting the Full Potential of CPUs is Getting Harder

The high-end Intel Xeon Platinum 9282 has a theoretical peak of 9.3 TFlop/s (see Figure 1). This performance can be achieved only if the code can utilize the 56 cores and AVX-512 fused multiply-add instructions with a single instruction, multiple data vector width of eight doubles. If the code is not vectorized, users get 1/8 of the performance; if it is not multithreaded, they get another 1/56 of that. In practice, many issues—such as the algorithm’s arithmetic intensity and the spatial/temporal locality of the implementation—further reduce performance. Sole reliance on compilers to automatically solve all of these issues has been unsuccessful. Furthermore, porting code to GPUs is not the only challenge; both CPUs and GPUs may suffer from inadequate numerical algorithms and software ecosystems.

Deep Learning Inference Drives Hardware Architecture Design

Deep learning applications can tolerate extremely low levels of precision, and GPU vendors are taking advantage of this fact. While NVIDIA H100 GPUs achieve 134 TFlop/s in FP64 with Tensor Cores, they have a theoretical peak of about 2 PFlop/s for code that can fully utilize the FP16/BF16 sparse Tensor Cores, and close to 4 PFlop/s for code that can use FP8/INT8. Another bifurcation will evolve here if the scientific computing community does not exploit this extra computing power.

Techniques exist to recover double-precision accuracy via low-precision Tensor Cores at the matrix multiplication level [9, 10] and the solver level [1, 4]. But at the moment, such techniques are primarily exploited for certain dense/sparse linear algebra operations. In order for a wider range of scientific applications to benefit from these architectures, we need cross-cutting research in three key areas: (i) Theory and guidelines to profit from mixed precision while maintaining the fidelity of results [6], (ii) simulation software to improve performance portability, and (iii) robust and trustworthy software ecosystems to encapsulate functionality [8].

The Deep Learning Community Benefits From Algorithmic Uniformity

Scientific computing no longer drives supercomputing architectures. GPUs were originally developed to solve a few well-defined tasks in the realm of games and interactive graphics computing. Now they serve the deep learning community by championing the forward and backward propagation of neural networks. This circumstance allows hardware manufacturers to aggressively optimize their architectures for a very narrow set of algorithms. Because of energy and cost constraints, processors can no longer simultaneously be general purpose and high performance. Neural network architectures have coevolved with GPUs; transformers (types of deep learning models) can now extract a large portion of the Tensor Core’s theoretical peak Flop/s.

Architectural Features Drive Progress in High-performance Scientific Computing

The diversity of algorithms that are needed to address the broad range of scientific computing applications complicates the unified coevolution approach (see Figure 2). Does this mean that the scientific computing community will not be able to fully embrace the hardware evolution? We must redesign current algorithms, devise new algorithms, and refactor and redesign code to fully leverage the fast, low-precision hardware units and other features of heterogeneous architectures for scientific computing. A variety of groups within the international community have been pushing towards exascale computing with advances in algorithms and software [2, 3]. Because research gains that are driven by these extreme-scale systems tackle the on-node functionality that manifests throughout all scales of computing, this work also provides a foundation for performance improvement across desktops and clusters.

Figure 2. Applications in scientific computing broadly impact science and society. For example, teams in the U.S. Department of Energy’s Exascale Computing Project (ECP) are devising novel algorithms that are encapsulated in reusable software technologies to advance discovery in chemistry, materials, energy, Earth and space science, data analytics, optimization, artificial intelligence, and more [5]. Figure courtesy of ECP applications teams led by Andrew Siegel and Erik Draeger.

An Urgent Need for New Research and Scientific Software Ecosystems

The scientific computing community faces an urgent need for research on numerical algorithms that exploits the features of heterogeneous architectures, including mixed precision (wherever applicable), complex memory hierarchies, and massive parallelism. Equally important is the development of open-source community software ecosystems that encapsulate this functionality for ready use by everyone. A combined approach of algorithmic advances and international collaboration on robust and trustworthy scientific software ecosystems will enable us to counter bifurcation and fully exploit the computing power of new heterogeneous architectures at all scales of computing.

Get Involved

Addressing the challenges of next-generation algorithms and software requires a wide range of expertise from the international scientific computing community, including members of multiple SIAGs that address relevant topics: SIAG/SC, SIAG on Computational Science and Engineering, SIAG on Data Science, SIAG on Linear Algebra, SIAG on Applied Mathematics Education, and SIAG on Equity, Diversity, and Inclusion. Also valuable are topical groups that focus on imaging science, life sciences, and geosciences, as well as partnerships with the IEEE Computer Society’s Technical Community on Parallel Processing and the Association for Computing Machinery’s Special Interest Group on High Performance Computing.

Interested in high-performance scientific computing? We encourage you to join SIAG/SC to engage with a vibrant community of scientists who consider a broad range of topics in numerical algorithms and computer architectures that directly contribute to high-performance computer systems. SIAG/SC promotes the exchange of ideas by focusing on the interplay of analytical methods, numerical analysis, and efficient computation.

Interested readers should consider attending the 2024 SIAM Conference on Parallel Processing for Scientific Computing (PP24), which will take place from March 5-8, 2024, in Baltimore, Md. Featuring an exciting program that highlights the newest advances in the field, PP24 is a great opportunity for attendees to learn about and discuss the latest happenings in high-performance scientific computing. Please join us!

A slightly amended version of this article published on the SIAG/SC website on November 8th.

References
[1] Abdelfattah, A., Anzt, H., Boman, E.G., Carson, E., Cojean, T., Dongarra, J., … Yang, U.M. (2021). A survey of numerical linear algebra methods utilizing mixed-precision arithmetic. Int. J. High Perform. Comput. Appl., 35(4), 344-369.
[2] Alam, S.R., McInnes, L.C., & Nakajima, K. (2022). IEEE special issue on innovative R&D toward the exascale era. IEEE Trans. Parallel Distrib. Syst., 33(4), 736-738.
[3] Anzt, H., Boman, E., Falgout, R., Ghysels, P., Heroux, M., Li, X., … Yang, U.M. (2020). Preparing sparse solvers for exascale computing. Phil. Trans. R. Soc. A: Math. Phys. Eng. Sci., 378(2166), 20190053.
[4] Haidar, A., Bayraktar, H., Tomov, S., Dongarra, J., & Higham, N.J. (2020). Mixed-precision iterative refinement using tensor cores on GPUs to accelerate solution of linear systems. Proc. R. Soc. A: Math. Phys. Eng. Sci., 476(2243), 20200110.
[5] Kothe, D., Lee, S., & Qualters, I. (2019). Exascale computing in the United States. IEEE Comput. Sci. Eng., 21(1), 17-29.
[6] Ltaief, H., Genton, M.G., Gratadour, D., Keyes, D.E., & Ravasi, M. (2022). Responsibly reckless matrix algorithms for HPC scientific applications. Comput. Sci. Eng., 24(4),12-22.
[7] Matsuoka, S., Domke, J., Wahib, M., Drozd, A., & Hoefler, T. (2023). Myths and legends in high-performance computing. Int. J. High Perform. Comput. Appl., 37(3-4), 245-259.
[8] McInnes, L.C., Heroux, M.A., Draeger, E.W., Siegel, A., Coghlan, S., & Antypas, K. (2021). How community software ecosystems can unlock the potential of exascale computing. Nat. Comput. Sci., 1, 92-94.
[9] Ootomo, H., Ozaki, K., & Yokota, R. (2023). DGEMM on integer matrix multiplication unit. Preprint, arXiv:2306.11975.
[10] Ootomo, H., & Yokota, R. (2022). Recovering single precision accuracy from tensor cores while surpassing the FP32 theoretical peak performance. Int. J. High Perform. Comput. Appl., 36(4), 475-491.
[11] Rosso, M., & Myers, A. (2021). A gentle introduction to GPU programming. Better Scientific Software Blog. Retrieved from https://bssw.io/blog_posts/a-gentle-introduction-to-gpu-programming.
[12] Trefethen, N. (2023, September 1). A bifurcation in Moore’s law? SIAM News, 56(7), p. 2.

Michael Bader is an associate professor in the TUM School of Computation, Information and Technology at the Technical University of Munich, where he works on hardware-aware algorithms and software for high-performance computing. He served as program director of the SIAM Activity Group on Supercomputing (SIAG/SC) from 2022-2023 and is co-chair of the 2024 SIAM Conference on Parallel Processing for Scientific Computing. Hatem Ltaief is a principal research scientist in the Extreme Computing Research Center at King Abdullah University of Science and Technology (KAUST). His research interests include parallel numerical algorithms and performance optimizations for manycore architectures. Ltaief served as vice chair of SIAG/SC from 2022-2023 and leads SIAG/SC outreach activities. Lois Curfman McInnes is a senior computational scientist and Argonne Distinguished Fellow in the Mathematics and Computer Science Division at Argonne National Laboratory. Her work focuses on high-performance scientific computing, with an emphasis on scalable numerical libraries and community collaboration towards productive and sustainable software ecosystems. She served as chair of SIAG/SC from 2022-2023. Rio Yokota is a professor in the Global Scientific Information and Computing Center at the Tokyo Institute of Technology. His research interests lie at the intersection of high-performance computing and machine learning. Yokota served as secretary of SIAG/SC from 2022-2023 and leads the SIAG/SC Committee, whose members help plan various SIAG/SC initiatives.