By Jeffrey Brock, Alden Bumstead, and Björn Sandstede
Brown University’s Data Science Initiative is in its second year of running a TRIPODS institute. TRIPODS—Transdisciplinary Research in Principles of Data Science—is a program established and funded by the National Science Foundation (NSF) to “bring together the mathematics, statistics, and theoretical computer science communities to develop the theoretical foundations of data science through integrated research and training activities.” Brown’s institute, entitled “Foundations of Model Driven Discovery from Massive Data,” is one of 12 TRIPODS institutes across the United States.
The Data Science Initiative serves as a hub for research and education in the foundational methodologies, domain applications, and societal impacts of data science at Brown. A collaborative effort of Brown’s Departments of Biostatistics, Computer Science, and Mathematics, as well as the Division of Applied Mathematics, the initiative was established in 2015 to foster methodological and domain-driven data science research and support educational and training activities in data science that intersect these disciplines. Thus, it was natural for our team—consisting of Jeffrey Brock (Department of Mathematics), Stuart Geman and Björn Sandstede (Division of Applied Mathematics), Joseph Hogan (Department of Biostatistics), and Eli Upfal (Department of Computer Science)—to apply for a TRIPODS grant.
Our institute’s research is organized around three themes: geometric and topological methods for complex data, new tools for causal and model-based inference, and data analysis on massive graphs and networks. A concrete example of our ongoing work is an effort by Geman’s team to improve collective understanding of deep neural networks: This research is based on the belief that one cannot substantially close the gap between human and machine performance with regard to interpretation (as opposed to classification) without architectures that support stronger representations. The group has developed a mechanism for embedding spatial and abstract relationships that bind parts and objects and define context within deep neural networks; currently the resulting dynamics considerably complicate both inference and estimation, although smaller experiments have been encouraging. A second project focuses on the development of algorithms that reveal differential genetic architecture between traits in sizeable genome data sets. The goal of this work is to use large genome-wide association studies that combine medical records with genome data from over 300,000 individuals to detect genes that share mutations across disease groups.
Training and community building have been major focal points of the institute over the past 18 months. Two activities were particularly successful: the first was an exciting weeklong summer bootcamp on “Topology and Machine Learning” in August 2018, a feat that would not have been possible without TRIPODS. This program was hosted at the NSF-funded Institute for Computational and Experimental Research in Mathematics and brought together 36 attendees from 20 institutions, ranging from undergraduate and graduate students to faculty and scientists from national laboratories. The bootcamp used tutorial lectures and hands-on labs to provide an in-depth introduction to topological data analysis, machine learning, and their intersection. At the end of the week, invited speakers presented their research in these areas and connected bootcamp topics to concrete research projects.
The second noteworthy event was a two-day informal networking workshop on “Building Community in the Foundations of Data Science,” which took place that same month. The workshop involved a series of short talks and brainstorming sessions dedicated to strengthening data science activity in the Northeastern U.S. A mix of participants from a range of research universities and liberal arts colleges made the workshop informative, inspiring, and forward-looking. Discussions focused on research, education, and the need for accessible data-fluency curricula. Among the outcomes was a concrete list of suggestions for how TRIPODS could further research and educational activities in the Northeast, i.e., by developing listings of researchers, their expertise, and methods to encourage collaboration; organizing smaller working groups around specific topics; fostering the co-design of courses by instructors from different organizations; and supporting students and postdoctoral researchers capable of bridging multiple institutions.
We are excited about our three TRIPODS+X projects that connect our TRIPODS institute at Brown with researchers from Duke University, Montana State University, Rutgers University, Smith College, Valparaiso University, and Yale University. One project aims to develop computational tools to identify gene regulatory networks from genome-wide expression data sets, another seeks to build methodological and theoretical bridges between neuroscience and data science, and a third attempts to improve data science education for undergraduates by recognizing student misconceptions about data science concepts and mismatches between core curricular elements and daily work of early-career practitioners.
More information about the 12 TRIPODS institutes and corresponding events is available at the TRIPODS website.
Jeffrey Brock is a professor of mathematics and the inaugural Dean of Science in the Faculty of Arts and Sciences at Yale University. Alden Bumstead is associate director of the Data Science Initiative at Brown University. Björn Sandstede is a professor of applied mathematics, a Royce Family Professor of Teaching Excellence, and director of the Data Science Initiative at Brown.