# Reflections on the Royal Society’s “Numerical Algorithms for High-performance Computational Science” Discussion Meeting

The Royal Society discussion meeting "Numerical Algorithms for High-Performance Computational Science"—organized by Jack Dongarra, Laura Grigori, and Nick Higham—took place at the Royal Society in London this past April. This two-day meeting promised much information for those of us expected to provide algorithmic support for new software developments intended to model with increasingly-greater fidelity and precision. Efficient numerical algorithms allow applications such as mine (pertaining to nuclear fusion experiments) to effectively exploit current computers, especially when the hardware is a high-performance computer that draws as much power as a small town.

Presenter David E Keyes (King Abdullah University of Science and Technology) perceived algorithms as the neck of an hourglass — the critical component for overall device operation. Conceptually, the “neck” connects two large bulbs, which (instead of sand) respectively contain high-performance hardware and ultimate applications to scientific or engineering problems. I came away from this talk with the impression that the hourglass dimensions also correspond to the relative importance of funding sources, since many speakers’ slides strongly reflected the scientific application and/or the hardware of their sponsoring bodies. The slides have since been posted online and are a valuable resource for those interested in seeing how their application and/or chosen hardware fits into the current international scene, as they indicate what is or will become feasible in the sciences and machine learning over the next few years. Prompt publication in a special edition of the *Philosophical Transactions of the Royal Society A* is also expected.

I suspect that many of us would like an answer to the following question: “For what kind of machine should we be designing our algorithms?” But as we might have expected, none of the experts present at the meeting were prepared to commit in any detail. The consensus remains that we can only reach Exascale—i.e., machines capable of 10^{18} calculations per second—by connecting vast numbers of chips that will run or “clock” at roughly present-day speeds, meaning that the need for communication between all of these chips will be a major limiting factor. Presenters hence emphasized data compression techniques—typically exploiting the use of numerical analyst’s friend, singular value decomposition (SVD) at base level—and the use of reduced precision to represent numbers, down dramatically from the modern standard of 64 bits.

The latter rings a few alarm bells for me, as I remember spending weeks in the 1980s porting code from 64-bit CRAY processors back to 32-bit desktops. My experience was that the results inevitably change and these changes must be quantified, thus making investigation of the question as to whether they are important a painfully open-ended process. Fortunately, modern work is going well beyond the simple rediscovery that a careful numerical analyst can solve many physics problems in 32-bit, as far more scope now exists to vary precision throughout a calculation.

This intrigues me because so many applications use time-step adaptive solvers to solve ordinary differential equations (ODEs). Simply put, we can compare numerical solutions to ODEs computed by two schemes with different orders of accuracy, and vary the timestep according to their difference. In one important case of mine, a symmetry in the physical problem means that the leading nonlinear term for evolution of amplitude \(a\) is cubic rather than quadratic, i.e.,

\[\ddot{a} = -dV/da = \alpha a- \epsilon a^3,\]

where \(a \gg \epsilon\) are both positive constants. I find that over an oscillation of unit amplitude—even for small \(|a|\), where the linear term is dominant—one must accurately account for the cubic term or significant failure of conservation of energy \(\dot{a}^2+V(a)\) can occur. Since I might reasonably expect to not need the same accuracy elsewhere, adaptive precision should reduce my computational costs. Yet even at present, practically important details—as precisely when to change timestep size, and by how much—make for relatively complicated code, so I see a challenge in simultaneously varying the precision.

More generally, speakers also wanted to use reduced precision to increase the nominal speed of floating-point calculations on each processor. I was intrigued to see suggestions that took this to the extreme and proposed that single-bit precision might be adequate for certain machine learning tasks. However, such algorithms would be deeply wrappered as far as the ultimate user is concerned, which my aforementioned ODE experience also commends. Similarly, hiding the use of data compression from the non-specialist—illustrated via an application to make efficient Exascale solvers for linear matrix algebra—seems like an excellent idea.

In terms of applications, one suggested value of Exascale is the ability to perform multiscale and/or multiphysics calculations. No inherent communications problems seem to exist with the former, since it involves an implicit compression going up the hierarchy of increasing scale, but multiphysics might well imply the transmission of entire data arrays. One of my applications would need to couple an array representing the magnetic field from a Maxwell’s equations solver to determine forces on a plasma, modelled using a separate Navier-Stokes fluid equations solver. I saw no “magic bullet” for this coupling, only perhaps the obvious: the more closely that fields had to be linked, the more their solvers should share the same data structures and software libraries. This becomes even more true if data movement is further reduced by processing data “in situ,” i.e., without moving from the chip where it is generated.

Sample applications highlighted the need for development of algorithms involving hash tables and index big data, and DAGs directed acyclic graphs (DAGs) to establish whether one calculation depends on already knowing the results of another. More than one presenter suggested that we should get engaged with the manufacturers if performance is critical for our applications, so that the hardware and software can be developed together to ultimately achieve Exascale levels of performance.

Indeed, the “Numerical Algorithms for High-performance Computational Science” discussion meeting reinforced my preconceptions that I should aim to reshape my algorithms around a relatively restricted range of libraries for Exascale, while also preparing to become more deeply involved at levels (like DAGs) that I might have previously regarded as the province of compiler writers.

**Acknowledgments:** This work was funded by the RCUK Energy Programme and the European Communities, under the contract of an association between EURATOM and CCFE.