SIAM News Blog
SIAM News
Print

The High-Performance Geometric Multigrid: An HPC Performance Benchmark

By Samuel Williams, Mark F. Adams, and Jed Brown

The High-Performance LINPACK (HPL) Benchmark is the most widely-recognized metric for ranking high-performance computing (HPC) systems. It rose to prominence in the early 1990s, when its predicted ranking of a system correlated with the system’s efficacy for full-scale applications. Computer system vendors sought designs that would increase HPL performance, which in turn improved overall application behavior.

Unfortunately, this is no longer the case; in fact, the opposite is now true. Although the HPL Benchmark continues to be an effective proxy for applications based on dense linear algebra, it has lost its proficiency as a proxy for many applications relevant to the HPC community. HPL rankings of computer systems, which utilize work-optimal algorithms with high bandwidth and low latency requirements, are not well-correlated to real application performance nowadays. Motived by HPL’s increasing inapplicability, the High-Performance Geometric Multigrid (HPGMG) incorporates machine sensitivities that correlate well with the sensitivities of HPC applications.

HPGMG complements both HPL and the new High-Performance Conjugate Gradients (HPCG) Benchmark [2], with more stress on the memory system and network fabric than HPL and HPCG respectively. The TOP500 list is currently adding new rankings for HPCG and HPGMG to complement the venerable HPL.

HPGMG Design Principles

The following design principles of HPGMG are discussed in the HPGMG 1.0 whitepaper [1]:

A benchmark must reflect improvements to computer systems that benefit our applications, and is essential for documenting future improvements to HPC systems. The metric must be designed so that, as we optimize metric results for a particular platform, the changes will also lead to performance improvements realized by real applications. Any new metric we introduce must satisfy a number of requirements:

  • Accurately reflect the characteristics of contemporary high-efficiency algorithms.
  • Accurately reflect the principle challenges of future hardware design — a balanced combination of memory bandwidth, interconnect performance (both for small and large messages), computational performance, and reliability. It should not be possible to “cheat” the benchmark by over-provisioning the hardware in any one of these areas. A machine designed for the sole purpose of performance on our metric should result in a “pretty good” machine for scientific and engineering applications.
  • The absolute improvements in this benchmark should ultimately be reflective of performance improvements realizable in real applications, which are occurring at a much slower rate than improvements in peak FLOPs.
  • It must be able to scale through many orders of magnitude improvement of hardware storage capacity and performance — much as HPL has for the past three decades.

No one benchmark can provide an accurate proxy of any particular application, but we believe that one comprehensive benchmark has two advantages over the alternative: a weighted set or bag of (simple) benchmarks.

The weighted bag of benchmarks does allow for explicit definition of machine metrics, measurement of these metrics in the benchmark and applications, and fitting of internal parameters so as to provide the best proxy for an application of interest. This approach can be more accurate than a single benchmark for a particular application or specific workload. One comprehensive benchmark, like HPL or HPGMG, is easier to administer, define, and adjudicate for a ranking metric.

Though we seek a rational approach—with modeling and measurements—for benchmark design, models cannot perfectly measure machine effectiveness. Therefore, a benchmark that implicitly demands an effective machine by solving a fundamentally hard problem is desirable. HPGMG has been designed with these principles in mind.

HPGMG Design

HPGMG uses the non-iterative form of multigrid—full multigrid—which requires \(O(N)\) flops, \(O(N)\) bytes from memory, \(O(N)\) bytes from cache, \(O(N^{2/3})\) MPI bytes, \(O(\log^2(N))\) messages, and \(O(\log^2(N))\) function calls/OMP overheads/CUDA kernels. For fourth-order accurate finite-volume of HPGMG (used for the ranking metric), this equates to about 1,200 flops and 1,200 bytes per DOF solved. As such, the arithmetic intensity (AI) is about 1.0 flop per byte. These measures are pretty strongly memory-bound, as most machines have AIs of five to 10 flops per byte, and thus incentivize higher memory bandwidth.

In practice (as observed with LIKWID), the L2 and L3 cache bandwidths are four times and two times higher than DRAM bandwidth. This stimulates a tapered cache hierarchy with progressively higher bandwidth when getting closer to the FPUs, i.e., one cannot just attach an FPU to an HBM stack, but must exploit locality for bandwidth filtering. The number of messages is polylogarithmic in the problem size. As such, architects are forced to drive down overheads when scaling up machines to avoid squandering performance. This fact differentiates HPGMG from both supercomputing and HPCG, which sends \(O(1)\) messages (a truncated \(V\)-cycle). MPI injection bandwidth is concurrently linked to problem size and thus memory bandwidth, but grows more slowly.

The polylogarithmic nature of HPGMG also manifests in the overheads for function calls, OMP parallel regions, and device offloads. This incentivizes architects and software technologies to drive down overheads as they increase memory and compute capacity, lest they forgo the benefit of threading or acceleration.

We compile the HPGMG list of the world’s largest supercomputers twice a year—with the metric equations solved per second—using a multigrid solve of a fourth-order accurate finite-volume discretization of the Laplacian. We published our first list at ISC High Performance 2016, and have continued releasing lists biannually. Table 1 depicts a selected and compressed sampling of the November 2017 list.

Table 1. A compressed sampling of the High-Performance Geometric Multigrid (HPGMG) list of the world’s largest supercomputers, as of November 2017.

We encourage community participation and invite comments and contributions to SWWilliams@lbl.gov. Visit the HPGMG webpage to learn more about the effort. Detailed analysis and current rank lists are also available online.

References
[1] Adams, M.F., Brown, J., Shalf, J., Straalen, B.V., Strohmaier, E., & Williams, S. (2014). HPGMG 1.0: A Benchmark for Ranking High Performance Computing Systems (LBNL Technical report, LBNL-6630E). Lawrence Berkeley National Laboratory.
[2] Dongarra, J., Heroux, M.A., & Luszczek, P. (2016). High-performance conjugate-gradient benchmark: A new metric for ranking high-performance computing systems. Int. J. High Perf. Comp. App., 30(1), 3-10.

Samuel Williams received his Ph.D. in computer science from the University of California, Berkeley in 2008, and is a staff scientist in the Computational Research Division at Lawrence Berkeley National Laboratory (LBNL). His work focuses on hardware/software solutions that affect performance, scalability, productivity, energy efficiency, and analytical capabilities on emerging multicore and accelerator-based supercomputers. Mark F. Adams received his Ph.D. in civil engineering from the University of California, Berkeley in 1998, and is currently a staff scientist at LBNL. He worked at Sandia National Laboratories and Columbia University before joining the LBNL staff in 2013. Jed Brown received his Dr.Sc. from ETH Zürich in 2011. He worked at Argonne National Laboratory until joining the University of Colorado Boulder as an assistant professor in 2015.

blog comments powered by Disqus