By Gail Corbett
Data science is a hot field. New institutes have been created at Berkeley, Columbia, NYU, Stanford, and elsewhere, with funding from the Sloan Foundation, the Moore Foundation, and the Simons Foundation; it’s almost impossible to keep track of the educational programs emerging in the field. Quite naturally, many of the new institutes and degree programs place a heavy emphasis on computer science (machine learning, say) or statistics.
In mid-May, hoping to learn about developments in the field and, in particular, the role of applied mathematics in it, SIAM News visited Leslie Greengard, director of the not yet one-year-old Simons Center for Data Analysis in New York City’s Flatiron District.
With his background and interests, Greengard is at home in the worlds and, notably, the languages of both the mathematical and the life sciences. In July, in a talk at the Simons Foundation, which has extramural programs in mathematics and physical science, the life sciences, and autism research, he touched on some of the exciting research avenues then on his mind.
On September 1, he became the founding director of the Simons Center for Data Analysis. Officially, he now spends two thirds of his time at SCDA, one third at Courant.
Clearly, Greengard has succeeded in his quest for something new, beginning with one of his first concrete tasks: deciding on the form a functioning internal research lab would take. Modeled after the “old Bell Labs,” SCDA aims to house between 30 and 40 researchers, with the possibility of year-long visits by senior people.
Like Bell Labs, the Simons setting offers almost constant exposure to good problems, in part via visiting investigators from the Simons extramural research programs. Although SCDA researchers will be free to pursue their own interests, it makes sense for staff scientists to make significant contact in their work with experimental efforts in neuroscience, genomics, and a wide variety of other fields supported by the Simons Foundation.
The new center, Greengard emphasizes, “is concerned with data science that is closely connected to an underlying scientific model.” In choosing the right problems to work on and in assembling a critical mass of researchers, this is the principle that guides his thinking.
SCDA’s 30-40 researchers, he says, will make up five groups—two devoted to infrastructure (mathematics and computer science, and software development) and the three others to the somewhat overlapping areas of genomics, systems biology, and neuroscience. To date, he has made three senior hires: Ian Fisk (most recently computing coordinator, CMS/CERN), deputy director for computing; Nick Carriero (most recently senior research scientist in computer science and co-director, W.M. Keck biotech lab high performance computing resource, Yale School of Medicine), group leader for software development; and Dmitri Chklovskii (most recently leader of a group at Janelia Farm Research Campus), group leader for neuroscience.
Greengard currently spends a lot of his time thinking about problems and research areas that promise to meet his, and Simons’s, expectations for SCDA. Beyond the usual academic motivation of furthering science, SCDA’s mission includes the development of tools that are likely to find wide use. The center “will be partly a software shop,” Greengard says. Internally funded by the Simons Foundation, SCDA researchers will be collaborating with colleagues in both experimental and theoretical disciplines across academic, government, and industrial research labs. Those connections will be built in part around the creation of new algorithms and in part around the development of new software tools and frameworks.
Always animated, Greengard becomes even more so in discussing promising problems for SCDA to take on. While active projects in genomics and bioinformatics are under way at the foundation, most of those he describes for SIAM News touch on problems that arise in neuroscience and structural biology, where new experimental techniques are yielding vast streams of data. He mentions the much-needed development of standards for processes that are not now standardized.
A case in point is the analysis of spikes produced on insertion of an electrode, or more interestingly a multi-electrode array, in the brain. The first task in analyzing the data is to determine which of an unknown number of neurons is giving rise to which spike. This process of disambiguation (“spike sorting”) is typically based on the subtly different shapes corresponding to different neurons. At present, Greengard points out, each lab processes data in its own way, generally aided by a variety of academic and commercial software packages. The raw data, however, tends not to be publicly available. What’s needed, and what he considers a worthy project for SCDA, is a platform where people make their data available (for reproducibility) and where scientists can easily develop and test new algorithms. Greengard hopes that the same platform will be useful for EEG and MEG data analysis as well—despite differences in the specific computational tasks, much of the software and data-handling infrastructure will be the same.
He also mentions cryo-electron microscopy as a biologically important and mathematically compelling area of research. “Even though the field is now several decades old, new microscopes and new electron detectors have made this an exciting research area.” Cryo-EM (also called single-particle EM) is a way to determine the three-dimensional structure of proteins without the need for crystallization, making it a much more accessible and high-throughput modality. Beginning with noisy projection data from frozen configurations of the proteins in unknown orientations, the mathematical problem is to determine the corresponding atomic-resolution structure. In the course of the experiment, the electron beam distorts not only the proteins, but the entire ice sheet in which they’re embedded.
Greengard views the development of next-generation tools for cryo-EM as a “great poster child for computational science.” Success will involve “physical modeling, Fourier analysis, numerical analysis, and data science all rolled into one.” He has been discussing the problem with Amit Singer of Princeton, one of a group of mathematicians who have been working in the area for some time (and an invited speaker on the subject at the 2012 SIAM Conference on Imaging Science). “It’s a great match of modern computational science with a biologically important problem,” Greengard says, “and a great example of the kinds of problems we intend to focus on.”
SCDA will also work with the Simons Collaboration on the Global Brain, an initiative launched by the Simons Foundation that involves some of the key researchers in the field, including David Tank of Princeton, Larry Abbott of Columbia, William Newsome of Stanford, Anthony Movshon of NYU, and Gerald Fischbach, chief scientist and fellow of the foundation. With ties to the major US BRAIN initiative (Brain Research through Advancing Innovative Neurotechnologies), this exemplifies the major problems in data science, all requiring new methods and algorithms, to be tackled by SCDA scientists.
Greengard will give this year’s John von Neumann Lecture (“Fast, Accurate Tools for Physical Modeling in Complex Geometry”) in July at the SIAM Annual Meeting in Chicago. He suggested an alternative, snappier title during the SIAM News visit: “Anatomy of a Problem.”
Gail Corbett is the managing editor of SIAM News.