| January 29, 2018

The High-Performance Conjugate Gradients Benchmark

By Jack Dongarra, Michael A. Heroux, and Piotr Luszczek

The High-Performance LINPACK (HPL) Benchmark has been a measure of supercomputing performance for more than four decades, and the basis for the biannual TOP500 list of the world’s fastest supercomputers for over 25 years. The benchmark is one of the most widely recognized and discussed metrics for ranking high-performance computing (HPC) systems. When HPL gained prominence as a performance metric in the early 1990s, there was a strong correlation between its predictions of system rankings and the ranking realized by full-scale applications. In these early years, computer system vendors pursued designs that would increase HPL performance, thus improving overall application function.

Similarity of computations was partially responsible for this correlation. For example, frontal matrix solvers were commonly used in engineering applications and often consumed a large fraction of compute time. The computational and data access patterns of these solvers are similar to HPL. Moreover, memory system and floating-point computation performance was much more balanced. For example, the Cray YMP and C90 (1990s systems) could perform two reads and a write per clock cycle, enabling near-peak performance for writing the weighted sum of two vectors as another vector, the so-called AXPY operation. On today’s modern processors, that operation executes at about one to two percent of peak speed. HPL has a computational complexity of \(O(n^3)\) and a data access complexity of \(O(n^2)\), so simply running larger problems meant that its performance was minimally impacted by this trend. Many real applications have moved to more efficient algorithms with computational complexity closer to \(O(n)\) or \(O(n\log_2n)\) and similar data access complexity, and have realized a much smaller performance gain from computer system improvements. Even so, the net performance improvement in time to solution of new algorithms on new platforms has far exceeded HPL improvements. In contrast, time to solution for HPL is now measured in days, and is a serious concern for benchmarkers on leadership platforms.

We expressed the following in [1]:

HPL remains tremendously valuable as a measure of historical trends and as a stress test, especially for the leadership class systems that are pushing the boundaries of current technology. Furthermore, HPL provides the HPC community with a valuable outreach tool, understandable to the outside world. Anyone with an appreciation for computing is impressed by the tremendous increases in performance that HPC systems have attained over the past few decades in terms of HPL. At the same time, HPL rankings of computer systems are no longer so strongly correlated to real application performance, especially for the broad set of HPC applications governed by differential equations.

These tend to strictly demand high bandwidth and low latency as they possess the aforementioned lower computational complexity. In fact, we have reached a point where designing a supercomputer for good HPL performance can lead to design choices that are either ill-suited for the real application mix or add unnecessary components or complexity to the system. Left unchecked, we expect the gap between HPL predictions and real application performance to increase in the future.

Many aspects of the physical world are modeled with partial differential equations, which help predictive capability, thus aiding scientific discovery and engineering optimization. The High-Performance Conjugate Gradients (HPCG) Benchmark is a complement to the HPL Benchmark and now part of the TOP500 effort. It is designed to exercise computational and data access patterns that more closely match a different yet broad set of important applications, and to encourage computer system designers to invest in capabilities that will impact the collective performance of these applications.

We articulated the subsequent ideas in [1]:

The setup phase [of HPCG] constructs a logically global, physically distributed sparse linear system using a 27-point stencil at each grid point in the 3D domain, such that the equation at point \((i,j,k)\) depends on the values of its location and 26 surrounding neighbors. The matrix is constructed to be weakly diagonally dominant for interior points of the global domain, and strongly diagonally dominant for boundary points, reflecting a synthetic conservation principle for the interior points and the impact of zero Dirichlet boundary values on the boundary equations. The resulting sparse linear system has the following properties:

A sparse matrix with 27 nonzero entries per row for interior equations and seven to 18 nonzero terms for boundary equations
A symmetric, positive definite, nonsingular linear operator
The boundary condition is reflected by subtracting one from the diagonal
A generated known exact solution vector with all values equal to one
A matching right-hand-side vector
An initial guess of all zeros.

The central purpose of defining this sparse linear system is to provide a rich vehicle for executing a collection of important computational kernels. However, the benchmark is not about computing a high fidelity solution to this problem. In fact, iteration counts are fixed in the benchmark code and we do not expect convergence to the solution, regardless of problem size. We do use the spectral properties of both the problem and the preconditioned conjugate-gradient algorithm as part of software verification.

The HPCG reference code is complete, standalone, and derived from mini-applications developed in the Mantevo project. It measures the performance of basic operations in a unified code:

Sparse matrix-vector multiplication
Vector updates
Global dot products
Local symmetric Gauss-Seidel smoother
Sparse triangular solve (as part of the Gauss-Seidel smoother).

The code is also driven by a multigrid preconditioned conjugate gradient algorithm that exercises the key kernels on a nested set of coarse grids. The reference implementation is written in C++ with MPI and OpenMP support.

Computer system vendors have invested significant resources to optimize HPCG, including the adaptation of their math kernel libraries to provide optimized functionality that can benefit the broader communities using these libraries.

HPL follows the peak performance of the machine relatively closely — a fact that is well known to benchmarking practitioners and most HPC experts. The performance levels of HPCG are far below those seen by HPL. This should not be surprising to those in the high-end and supercomputing fields and is attributable to many factors, including the commonly-cited “memory wall.”

HPCG has already been run on many large-scale supercomputing installations in China, Europe, Japan, and the U.S. (and off-planet in an orbiting satellite). The following chart shows the top 10 systems on the current HPCG Benchmark list, as of November 2017. A full list and more details are available online. HPCG results have been integrated into the TOP500 list.

High-Performance Conjugate Gradients (HPCG) Benchmark: Top 10 systems as of November 2017. The chart lists rank according to HPCG, computer location, computer name and specifications, core (processor) count, HPL performance, TOP500 rank, HPCG performance, and the fraction of the theoretical peak performance obtained for the HPCG Benchmark.

Acknowledgments: The authors thank the Department of Energy National Nuclear Security Administration for funding this work.

References
[1] Dongarra, J., Heroux, M.A, & Luszczek, P. (2016). High-performance conjugate-gradient benchmark: A new metric for ranking high-performance computing system. Int. J. High Perf. Comp. App., 30(1), 3-10. http://journals.sagepub.com/doi/abs/10.1177/1094342015593158.

Jack Dongarra is a university distinguished professor in the Department of Electrical Engineering and Computer Science at the University of Tennessee, and a distinguished research staff member in the Computer Science and Mathematics Division at Oak Ridge National Laboratory. Michael A. Heroux is a senior scientist at Sandia National Laboratories, director of S&W Technologies for the U.S. Department of Energy’s Exascale Computing Project, and scientist-in-residence at St. John’s University, Minn. His research interests include all aspects of scalable scientific and engineering software for new and emerging parallel computing architectures. Piotr Luszczek is a research assistant professor at the University of Tennessee and a research director at the university’s Innovative Computing Laboratory. His research interests are in performance engineering and numerical linear algebra. He teaches courses in these subjects at the graduate level and internationally at invited lectures.