| October 01, 2019

The Difficulties of Addressing Interdisciplinary Challenges at the Foundations of Data Science (Part II of II)

Part I of this article, which published in the September issue of SIAM News, described the National Science Foundation’s (NSF) Transdisciplinary Research in Principles of Data Science (TRIPODS) program, as well as structural, justification, and cultural challenges that arise from this effort.

While perhaps not immediately obvious, many seemingly innocuous decisions can provide a strong selection bias toward/against certain areas in terms of interdisciplinary efforts. The following actions all possess the potential for such selection bias: encouraging interdisciplinary interactions without understanding conflicting recognition and reward requirements; expecting publication immediately after one has vocalized an idea versus after he/she has clarified all of the details; deciding on a complex interdisciplinary effort’s final form quickly versus deliberately; requiring attendance throughout most or all of a long program; and listing authors alphabetically, based on contribution, or according to some other rule. Ignoring these issues—albeit often inadvertently—invariably undermines interdisciplinary efforts. Encouraging the research community to address these concerns in ways that draw strength from the diversity of researchers—and do not undermine their cultural sensibilities—continues to be a primary challenge.

Broader NSF Context

Discussions on this topic at the 2016 workshop on Theoretical Foundations of Data Science (TFoDS): Algorithmic, Mathematical, and Statistical occurred within a broader context of conversations at the NSF regarding the organization’s long-term research agenda. That same year, the NSF proposed its 10 Big Ideas, a set of “long-term research and process ideas that identify areas for future investment at the frontiers of science and engineering.” One of these ideas—“Harnessing the Data Revolution” (HDR)—focuses on both fundamental research in data science and engineering as well as the development of a 21st-century, data-capable workforce to help researchers exploit the Big Data revolution. This was part of the NSF’s effort toward “Growing Convergence Research,” another big idea that seeks to integrate multiple disciplines to advance scientific discovery and innovation. The timing was right, and the NSF’s first major investment toward HDR was the TRIPODS program.

With TRIPODS—and following the suggestions of the TFoDS workshop report—the NSF made a call for institutes on the foundations of data science. However, it was not overly-prescriptive as to what that means. Instead, it split the program into two phases—Phase I and Phase II (described in Part I of this article)—so that the community could determine what it desired from an institute that spanned three rather culturally different areas. This two-phase structure also permits a ramp-up period before full-scale institute activities—like research, education, workforce development, visitor hosting, and direction setting—begin. Phase I principal investigators (PIs) are addressing additional challenges, such as how to design institutes that do not grate against the standards of one or more of the communities, and instead yield a true synergy of all of the three disciplines’ best capabilities.

Each of the 12 Phase I institutes approaches these challenges and its individual mission in somewhat different ways, ultimately acting as a type of “experimental trial” for struggles and successes. One of TRIPODS’ more unusual aspects is the occurrence of a monthly PI call and annual PI meeting. These allow for frank discussion of what is and is not working at different Phase I institutes and within the general community. Indeed, one of TRIPODS’ most valuable aspects is its facilitation of a camaraderie between leading researchers with diverse backgrounds interested in similar challenges.

Interdisciplinary and “Antedisciplinary” Balance

TRIPODS’ emphasis on interdisciplinary foundations and its address of cultural challenges associated with cross-cutting research are discussion points at the PI calls and meetings.

The three core areas of TRIPODS—statistics, mathematics, and theoretical computer science—focus on questions that are relevant to the theoretical foundations of data science, but they do so in very different and sometimes incomparable ways. Each Phase I institute is adopting its own approach. The Foundations of Data Analysis Institute at the University of California, Berkeley will initially concentrate on four deep theoretical challenges: the possibility of a general complexity theory of inference in the context of optimization; the power of stability as a computational-inferential principle; the value of randomness as a statistical and algorithmic resource in data-driven computational mathematics; and the principled combination of science-based and data-driven models. These foundational challenges straddle existing cultures of disciplinary research. Each is situated squarely at the interface of theoretical computer science, theoretical statistics, and applied mathematics; and each is directly relevant to a wide range of very practical data science problems.

From the perspective of addressing cultural challenges associated with cross-cutting research, TRIPODS and the accompanying “TRIPODS model” offer the very real possibility of a “test case” on engineering funding for what Sean Eddy of Harvard University calls “antedisciplinary science” (where “ante” means “before,” not “anti” as in “against,” although the two are clearly related). He defines this as “the science that precedes the organization of new disciplines, the Wild West frontier stage that comes before the law arrives” [1]. The way in which data science will evolve—e.g., whether the field will look more like present-day computational science or computer science—remains to be seen. Nevertheless, Eddy’s discussion, which relates to the National Institutes of Health’s funding of computational biology is relevant to the discussion of cross-cutting research in the foundations of data and beyond. “Focusing on interdisciplinary teams instead of interdisciplinary people reinforces standard disciplinary boundaries rather than breaking them down,” he writes. “An interdisciplinary team is a committee in which members identify themselves as experts in areas besides the actual scientific problem at hand, and abdicate responsibility for the majority of the work because it’s not in their field.”

Many TRIPODS PIs are wrestling with these challenges, both in their own research agendas and their efforts to create Phase II TRIPODS institutes that are broadly useful to the community. Are the specifics of the current Phase I–Phase II structure the best way to coalesce community understanding of what should comprise a longer-term institute (either for foundations of data or more general cross-cutting challenges)? How can such institutes encourage interdisciplinary people as well as interdisciplinary teams? How can they highlight the broad usefulness of interdisciplinary foundational work without diluting its foundational content? How can we ensure that current design decisions do not deter substantial participation by one of the disciplines of interest? Of course, more obvious issues—ensuring that the whole is more than the sum of its parts and exploring novel ways to extend NSF funding—also exist.

While many challenging questions remain, participants of the current TRIPODS program are diligently tackling these questions to establish the foundations of data science.

References
[1] Eddy, S.R. (2005). “Antedisciplinary” science. PLoS Comp. Bio., 1(1), e6.

Michael W. Mahoney is affiliated with the International Computer Science Institute and the Department of Statistics at the University of California, Berkeley.