# State-of-the-Art FFT: Algorithms, Implementations, and Applications

#### Reflections on the 2018 SIAM Conference on Parallel Processing for Scientific Computing

The fast Fourier transform (FFT) is common in many areas of science and engineering. A two-part minisymposium at the 2018 SIAM Conference on Parallel Processing for Scientific Computing, which took place this March in Tokyo, was the second of a planned series of meetings to foster discussions about the FFT. Such discussions are meant to produce an FFT benchmark suite that is widely-accepted for supercomputer co-design. All meeting material is available online.

The speakers in this minisymposium reported on work carried out in Japan and elsewhere to address high-performance and parallel implementations of the FFT; mathematical encapsulations of FFT algorithms that are amenable to automatic implementations tuned to hardware platforms; and results of the FFT on current and emerging platforms, such as many-core processors, graphics processing units (GPUs), and distributed memory systems.

Daisuke Takahashi (University of Tsukuba) opened the first session with a talk about the implementation of parallel FFTs on an Intel Xeon Phi processor cluster. He reminded the audience that parallel FFTs on distributed-memory parallel computers require intensive all-to-all communication, which affects performance. To mitigate this issue and yield better performance for a one-dimensional-FFT with automatic tuning, Takahashi overlapped computation with communication. This had not previously been reported on an Intel Xeon Phi processor cluster. He used automatic tuning to select the number of divisions for the computation-communication overlap. The performance results demonstrate that the proposed implementation is efficient for improving performance on an Intel Xeon Phi processor cluster.

Franz Franchetti (Carnegie Mellon University) continued with a discussion on SPIRAL, the FFT library generation and autotuning framework. It aims for good performance without requiring performance optimization by the software developer on new computer architectures. Franchetti explained the idea and aspects of SPIRAL FFT, and showed results that outperform the Fastest Fourier Transform in the West (FFTW) and the Intel Math Kernel Library by a factor of 2X. He suggested that the community ask, “What corresponds to LAPACK interface for linear algebra in the spectral space? How can LAPACK success repeat for FFTs?”

Next, Jun Doi (IBM Research – Tokyo) spoke about the pipelining fast Fourier transform on the OpenPOWER System. Utilizing the NVLink interconnect between processors and between GPUs and central processing units (CPUs) in the OpenPOWER cluster considerably accelerates array transpose in the FFT calculation. Researchers have implemented several techniques, such as unified memory, on the three-dimensional-FFT using Cuda FFT (cuFFTXt) and GPU memory-blocking; cuFFTXt outperforms unified memory and memory-blocking techniques. Large performance improvement is obtainable (double of the cuFFTXt performance) via perfetch, i.e., helper functions for unified memory. All-to-all pipelining, where each array is divided into smaller arrays to make pipelines that would allow overlapping within a very small time, achieved even more performance. Future work will involve evaluation of real applications and performance across the nodes.

Akira Nukada (Tokyo Institute of Technology), the final speaker in the first session, described his approach for generating an FFT kernel for a given transform size. He implemented the auto-tuning FFT library for CUDA GPUs (CUDA PTX) for a one-dimensional-FFT and tested on NVIDIA GPUs to improve performance of transform sizes that are not a power of two. For PTX generation, both memory and compiling is fast. Nukada displayed a useful table comparing features and bottlenecks of AMD Radeon, NVIDIA GeForce, and NVIDIA Tesla C2050, where double precision is a bottleneck on GeForce.

Samar Aseeri (King Abdullah University of Science and Technology) began the second session with a broad introduction to FFTs. She started with the mathematical concepts, moved on to derivations and computations, and eventually reached the efficient algorithmic formulation that has become essential and widely-used in scientific applications. This included an overview of the different existing FFT algorithms, methods (Fourier, Spectral), and software packages. After presentation at a few more venues, this material is intended for XSEDE broadcast tutorial presentations about software libraries.

Benson Muite (University of Tartu) described the use of the FFT for a number of equations, such as Gray Scott, KdV-Burgers, three-dimensional Navier-Stokes, and the real cubic Klein-Gordon. He discussed the method’s efficiency for each equation. For the cubic Klein-Gordon, Muite showed strong scaling results on Mira for a 4096^{3} discretization by Brian Leu, Albert Liu, and Parth Sheth. Animations (by Brian Leu, Albert Liu, Parth Sheth, and Michael Quell) of solutions to the Klein-Gordon equation are available here, here, and here. Another scaling chart for the Klein-Gordon showed computation time for 30 time steps as a function of the number of processor cores for about 14 machines. Most were run with 2decomp&FFT, and a few compared 2decomp&FFT to the Fastest Fourier Transform in the East (FFTE). For the chosen problem, FFTE seemed to offer slightly better performance, though this is dependent on computer, problem size, and processor count. Muite used 512^{3} point discretization for all tests. A runtime estimation model provided a rough explanation of the pattern of results, and the machines were ranked according to time-to-solution.

Hari Sundar (University of Utah) emphasized that Gauss transform is not a rename of the Gauss-Fourier transform about which Aseeri spoke, but rather a different transform. The FFT algorithm reduces the computation cost of the Fourier transform from \(\textrm{O(N}^2)\) to \(\textrm{O(N}\log \textrm{N})\); his research aims to reduce the \(\textrm{O(N}^2)\) computations of the Gauss type transforms to lower orders. Sundar described the proposed algorithm approach and stages for uniform point distribution and non-uniform distribution for CPUs. He showed improved scalability performance on Titan—up to 4096 processors—for different applications using the translation scheme for uniform distributions and the octree-based scheme for non-uniform distributions.

Concluding the minisymposium was Truong Vinh Truong Duy (Nissan ARC, Japan), who presented industrial applications for which FFT calculations are required. He shed light on his experiences with OpenFFT, the three-dimensional FFT parallel package to reduce communication. Finally, he showed benchmarking results for OpenFFT, 2DECOMP&FFT, P3FFT, and FFTE on Cray XC30 and the K Computer, where FFTE output indicated a superior performance at large core counts.