# Next-Generation FFT Algorithms in Theory and Practice: Parallel Implementations and Applications

By Samar Aseeri

At the 2021 SIAM Conference on Computational Science and Engineering, Daisuke Takahashi (University of Tsukuba), Franz Franchetti (Carnegie Mellon University), Samar Aseeri (King Abdullah University of Science & Technology) and Benson Muite (Kichakato Kizito) organized a minisymposium about the fast Fourier transform (FFT). The session enabled the exchange of ideas between researchers who are working on alternative FFT algorithms and FFT implementations, especially in parallel hardware. Interested readers can learn more about the FFT initiative and view slides from this session online.

Takahashi gave the first talk, which displayed performance results regarding the evaluation of parallel three-dimensional FFT automatic tuning. He compared FFTE 7.1 alpha without overlap and FFTE 7.1 alpha with overlap in terms of four divisions, and FFTE 7.1 alpha with automatic tuning and FFTW 3.3.9. The weak and strong scaling graphs indicated that FFTE 7.1 with automatic running is the fastest. A graph of the breakdown of execution time for FFTE 7.1 alpha further explained how the proposed automatic facility for selecting the optimal parameters of the MPI process grid, computation-communication overlap, and block size had collectively reduced the all-to-all communication time. Takahashi’s presentation is available online.

Next, Franchetti spoke about attempts to develop FFTX (a modern version of FFTW that supports accelerators) and its interface SpectralPack. This is joint work with Lawrence Berkeley National Laboratory and falls under the Exascale Computing Project. Franchetti explained the basic ideas of the package, in which FFTX relies on SPIRAL: the FFT library generator and automatic tuning framework that was developed at Carnegie Mellon University. A header file is included in a proper C++ code that calls the generated transform FFTX, the generated transform consists of roughly 2,000 lines of code, and the performance is about 30-50 percent of the machine roofline. The project’s next step involves performance tuning. Franchetti’s presentation concluded with some audience discussion about how to configure code generation approaches to work well with both compilers and a roadmap of relevant programming languages.

Aseeri then spoke about a scheduling policy to improve communication time by 10 percent in parallel FFT. She started by providing a quick background about FFTs before introducing her project and defining its aim and elements of the work — such as the FFTK Library and the Cray XC40 Shaheen II system. Among the experiments conducted in this work is the job placement approach, wherein the Nid-marker tool helps place jobs on specific nodes that have physical all-to-all connections. The tool, which was specifically created for this work, takes the node-id list and visualizes its location in the cluster. Results show a 10 percent improvement in communication time.

Muite’s talk began with a brief summary about FFT and its usefulness in understanding the behavior of the different equations that allow for more delicate investigations. He then introduced the equations in question, which are related to the incompressible case with reduced solar physics. The equations are a useful model and consist of fluid velocity, a magnetic field, convection terms, pressure, and force. More complicated cases require researchers to examine electric and magnetic fields. Muite also provided some additional explanation about the equations and how one can efficiently and accurately solve them with FFT. When responding to an audience question, he noted that the equations represent a reduced set based on certain assumptions (i.e., the equations used for the sun are typically more complicated).

During his talk, Dietrich Foerster (Bordeaux University) argued that one can compute the band gaps of semiconductors (such as those that exist in computers, but also organic ones) with Lars Hedin’s GW approximation [1] in \(O(N^3)\) operations rather than in \(O(N^4)\) operations, where \(N\) is the number of atoms in the unit cell. Researchers can solve a memory bottleneck that appears in the \(O(N^3)\) algorithm by imposing a Nyquist condition. One needs the KAUST-Oslo 6D FFT algorithm [2] to transform correlations between real and Fourier space. Foerster concluded by mentioning that Aseeri’s test runs indicate no bandwidth bottleneck in the KAUST-Oslo algorithm in \(\textrm{d=6}\) dimensions for up to 112 Fourier modes per dimension and up to 4096 cores; this is enough even for organic semiconductors.

Upon the conclusion of the presentations and their respective question-and-answer sessions, Takahashi offered some closing remarks and thanked everyone for their participation.

**References**

[1] Hedin, L. (1965). New method for calculating the one-particle Green’s function with application to the electron-gas problem. *Phys. Rev.*, 139, A796.

[2] Dalcin, L., Mortensen, M., & Keyes, D.E. (2019). Fast parallel multidimensional FFT using advanced MPI. *J. Para. Dist. Comp.*, 128, 137.
Samar Aseeri is a computational scientist at King Abdullah University of Science and Technology (KAUST) in Saudi Arabia.