The Difficulties of Addressing Interdisciplinary Challenges at the Foundations of Data Science (Part I of II)

By Michael W. Mahoney

The Transdisciplinary Research in Principles of Data Science (TRIPODS) program is an effort funded by the National Science Foundation (NSF) to establish the foundations of data science. It aims to unite the statistics, mathematics, and theoretical computer science research communities (three areas central to the foundations of data) to build the theoretical foundations that will enable continued data-driven discovery and breakthroughs in the natural and social sciences, engineering, and beyond.

The size and scope of its interdisciplinarity make TRIPODS an unusual endeavor. The program’s first phase consists of $17.7 million in funding for 12 Phase I institutes. After an initial three-year effort (currently in progress), a longer second phase will consist of a smaller number of larger, full-scale Phase II institutes. Additionally, a related TRIPODS+X program is designed to expand the scope of the original TRIPODS Phase I projects to involve interactions with researchers from domain “X,” where “X” is astronomy, genetics, materials science, neuroscience, the social sciences, or any one of a wide range of other data-driven disciplines. Spurred by the success of TRIPODS and the excitement of using it as a model to fund interdisciplinary research more generally, the NSF recently announced the creation of a second parallel TRIPODS Phase I–Phase II program; this new cohort also includes electrical engineering.

I am the director and principal investigator (PI) of the new University of California, Berkeley Foundations of Data Analysis (FODA) Institute, one of the 12 original TRIPODS Phase I institutes. I work with my co-PIs—Peter Bartlett, Michael Jordan, Fernando Perez, Bin Yu, and Uros Seljak (with TRIPODS+X)—to make transformational advances in the interdisciplinary foundations of data science, incorporating both teaching and research. We collaborate with a range of campus partners—such as the Berkeley Institute for Data Science (BIDS), Simons Institute for the Theory of Computing, Real-time Intelligent Secure Explainable (RISE) Lab, and Lawrence Berkeley National Laboratory—that address other complementary aspects of data science. Beyond simply proving theorems, we are interested in how the foundations interact synergistically with increasingly data-driven domain sciences. For me, these efforts follow a long line of previous work, including the Workshop on Algorithms for Modern Massive Data Sets (MMDS) meetings [1, 3] and the 2016 Park City Mathematics Institute summer school on “The Mathematics of Data” [2].

The TRIPODS program is both timely and important. Every field needs foundations to understand when and whymethods work as they do, but the three areas that TRIPODS identifies as being closest to the foundations of data science have very different foundational principles. Much data science education and training is currently limited to teaching tools (Python routines, etc.), rather than an inherent understanding of foundational principles. While foundational research may result in plainer graphics and be less immediately applicable than work in other more applied areas, failing to invest in foundational questions will lead to fields that are less intellectually rich and have hollower shells. In such scenarios, deeper connections between superficially-different methods applied in very diverse areas will not be recognized, understood, or exploited.

Furthermore, the TRIPODS program is relevant and impressive as a model for funding cross-cutting research more generally. The interdisciplinary challenges in orchestrating fruitful interactions between foundational computer scientists, statisticians, and applied mathematicians will be mirrored—probably much more so—when considering social, behavioral, and economic challenges associated with large-scale computing platforms; ethical and responsible uses of data; machine learning for materials science or biomedical science; or any more outward-facing complications important to data science.

In both this article and part II, which will appear in the October issue of SIAM News, I discuss my experiences with several of these challenges.

Community Background

Beyond facing difficult technical questions about the foundations of data, building cross-cutting platforms between different disciplines and conducting truly interdisciplinary research is very arduous. Funding mechanisms, hiring processes, and contrasting disciplinary cultures all conspire against this. If the goal is to bridge the gap between disparate research communities, then understanding the communities’ backgrounds helps us identify the following three key classes of challenges that arise in such efforts.

Structural Challenge

TRIPODS stems in part from a phenomenon, a sort of “structural chasm,” that can cause interdisciplinary work to “fall between the cracks.” Scientists conducting research that cuts across traditional community disciplines are familiar with the effects of this occurrence. If one has a proposal that sits squarely between the NSF’s Division of Mathematical Sciences (DMS)—which funds mathematics and statistics research—and its Directorate for Computer and Information Science and Engineering (CISE)—which funds computer science research—then the individual must decide where to submit the proposal. Upon sending it to the DMS, reviewers might decide that while the proposal contains great ideas and may have high impact, it isn’t quite within the scope of the department and could be better suited for CISE. Reviewers at CISE may react the same way, deeming the proposal more appropriate for the DMS. On a related note, although universities are great at putting together interdisciplinary teams, they are much less adept at hiring interdisciplinary people. This sort of structural challenge, whereby newly-forming areas do not conform well to existing administrative silos, is perhaps the most obvious type of issue that arises in interdisciplinary efforts.

Justification Challenge

Motivated by this predicament and prompted by the NSF, Petros Drineas (Purdue University) and Xiaoming Huo (Georgia Institute of Technology) organized an exploratory workshop on Theoretical Foundations of Data Science (TFoDS): Algorithmic, Mathematical, and Statistical in April 2016. The workshop’s objective was “to identify important research challenges that strengthen and broaden the mathematical, statistical, and algorithmic foundations of data science.” Participants discussed potential opportunities for collaboration between the relevant communities, and investigated workforce challenges and infrastructure development. Ultimately, the meeting yielded a report that suggested the creation of a “center or institute, funded by an agency such as the NSF…that emphasizes the foundational aspects of data science.”

Much discussion during TFoDS addressed a justification challenge, whereby foundations should inform—and be informed by—very practical problems (i.e., avoid representing pure theory divorced from practice) while not having to justify existence in terms of its immediate usefulness in particular domains.

Cultural Challenge

A lot of TFoDS conversation also focused on what form the aforementioned center or interdisciplinary institute could take, or whether an institute is even the best mechanism to propel the area forward. However, there was little consensus. Unresolved questions include the following.

Should such an institute run short meetings or long programs?
What are attendance expectations? Should the institute be an “incubator” for projects that cut across the three relevant areas?
Would these projects be funded by “internal” proposal solicitations? Should they be virtual or physically located in one place?
How would one encourage long-term interactions? Should projects attempt to promote “risky” research, particularly among younger researchers?
How do these desires relate to the timescales for publication, recognition, and reward?

Far from being administrative issues, these questions (and the lack of straightforward answers) highlight a cultural challenge. Structuring interactions between different areas—given their very distinct styles and modes of interaction, in addition to the conflicting power dynamics within and between them—is an extremely complicated task.

Part II of this article, to be published in the October issue of SIAM News, will describe the broader NSF context as well as lessons for interdisciplinary and antedisciplinary balance, both for foundations of data and in a more general framework.

References
[1] Golub, G.H., Mahoney, M.W., Drineas, P., & Lim, L.-H. (2006, October). Bridging the gap between numerical linear algebra, theoretical computer science, and data applications. SIAM News, 39(8).
[2] Mahoney, M.W., Duchi, J.C., & Gilbert, A.C. (Eds.) (2018). The Mathematics of Data. IAS/Park City Mathematics Series (Vol. 25). Providence, RI: American Mathematical Society.
[3] Mahoney, M.W., Lim, L.-H., & Carlsson, G.E. (2008). Algorithmic and statistical challenges in modern large-scale data analysis are the focus of MMDS 2008. SIGKDD Explor., 10(2), 57-60.

Michael W. Mahoney is affiliated with the International Computer Science Institute and the Department of Statistics at the University of California, Berkeley.