The interdisciplinary field of data science is growing rapidly, altering the way researchers process and react to data. As a result, a rising number of universities are incorporating data science courses and programs into their applied and computational mathematics curricula. At the 2017 SIAM Annual Meeting, which took place in Pittsburgh, Pa., this July, three panelists spoke about the various data science programs at their respective institutions, and commented on opportunities accompanying the versatile field’s direction.
“Data science is the sexist job of the 21st century,” Randy Paffenroth of Worchester Polytechnic Institute (WPI) joked, citing a 2012 article by the Harvard Business Review. “What’s really important is the idea of communication. How do you take the data analysis you do and package it in a way that is faithful to the math but understandable to someone who’s not a mathematician?”
Paffenroth attributes the growth of data science to the copious amounts of information currently available. YouTube grows at 400 hours per minute, the equivalent to 24,000 people uploading their entire lives in real-time streams. Google translates 69,500,000 words per minute, and 216,302 photos are shared on Facebook every 60 seconds. “Every time we go on the web, we’re participating in a data science experiment,” Paffenroth said. “Phones generate data all the time.”
Figure 1. Drew Conway’s data science Venn diagram. Image courtesy of Stack Overflow.
He explained that the field draws from three different domains—math and statistics, hacking skills, and substantive expertise—and presented Drew Conway’s data science Venn diagram to illustrate the overlaps (see Figure 1). The so-called “danger zone” occurs at the juncture of hacking and substantive expertise, when one doesn’t have sufficient mathematical experience.
Paffenroth spent a decade in industry before arriving at WPI in 2014, when the university’s data science program was just getting started. What began at the master’s level that fall has grown into a comprehensive research program that offers a Ph.D. The program’s structure is based on Conway’s diagram: a collaboration between the Department of Mathematical Sciences, the Department of Computer Science, and the Foisie Business School. Its status as an independent program, rather than a subset of a particular department or school, equally represents the three independent entities. “That’s one thing that we think has helped the program grow,” Paffenroth said. “It’s been pretty faithful to all of the pieces of it.” Faculty members are now working backwards to develop the undergraduate curriculum; the first group of undergraduate minors graduated in the spring of 2017.
Every data science student at WPI must take Integrative Data Science, a course created specifically for the program. “From this course, the students know all the other pieces of data science and decide what other courses they might want to take,” Paffenroth said. “Students aren’t going to be experts in all the things, but they’re at least aware of them all.” They then take courses from each of the three contributing departments/schools and complete a capstone project. Those who intend to pursue a Ph.D. are encouraged to write a master’s-level thesis, while an industrial-based group qualifying project is sufficient for students seeking immediate employment. All projects thus far have been sponsored by companies.
Kristin Bennett of Rensselaer Polytechnic University (RPI) is director and founder of the Data Interdisciplinary Challenges Intelligent Technology Exploration Laboratory (Data INCITE Lab) at RPI. Since its establishment in 2014, RPI’s data science program has grown to receive industrial support, and partners with industry and university researchers. “The program engenders engaged students from diverse majors with data analytics skills who are highly recruited for internships, co-ops, and jobs,” Bennett said.
Data science faculty begin recruiting students the second semester of freshman year. No prerequisites are required, though some calculus is expected. “We actually do real-life, impactful research using undergraduates,” Bennett said. “The program is quite innovative in that we go right after the students when they are just babies.” She spoke highly of Introduction to Data Mathematics, a course that teaches real-live data analytics, including classification and clustering, while incorporating smaller consulting projects. “The students who take this class want to do more,” she said. “They want to take statistics. They want to do computing. It’s pretty much getting to that storytelling aspect that is so compelling to students.” About half of the participating freshmen conduct paid research the following summer, which allows them to start accumulating a work portfolio at a young age. “Getting engaged in research is what really gets them excited,” Bennett said. “To do this as a freshman is very powerful.”
Bennett runs the summer research program like she runs her research lab, and undergraduates have tackled a wide variety of pertinent topics, including tuberculosis, circadian rhythms, computer chip production, Zika virus, and wind turbines. They use real-world data sets and observations to model these topics mathematically. “The big thing is learning how to translate an abstract model into a data analysis task,” she said. Goals of the summer session include working in an interdisciplinary team; posing powerful questions; communicating effectively; and developing engineering and infrastructure skills to implement, simulate, and analyze math models. Additionally, summer research is narrower and deeper than mathematics classes. “We’re putting you right into high-dimensional mathematics,” Bennett said. “We put this stuff to work. It’s much more focused.”
She assured faculty members that they don’t have to make an entire data science major to incorporate data science into the curriculum. “You can make and develop it in a fashion that is organic and grows from below,” she said. Thus far, the program has seen success in encouraging students to pursue future careers in data analytics, as early experience is key in recruiting to the field and getting involved. “A big problem with our current math sequence is that it just takes too long to get where you want to go,” Bennett said. “As a community we should be thinking about if we should be doing things in a different order.”
Louis Rossi of the University of Delaware (UD) is a member of the school’s data science task force, which aims to eventually establish a data science institute. Until then, UD continues to integrate the field into its mathematics curriculum. In 2016, the Department of Mathematics offered an online, textbook-free graduate course in data mining that filled up instantly. The department also ran an undergraduate data mining class in the fall of 2016, and introduced an entry-level undergraduate applied algebraic topology course appropriate for freshman and sophomores this fall. “Algebraic topology has something very fundamental to say about the shape of data,” Rossi said. That course filled up too.
Computer science classes on data structures and machine learning are naturally available to students, as are courses on recession analysis and statistics. Rossi added that the undergraduate capstone course is often data-heavy; one year, students tackled a machine learning whale identification problem using satellite imagery. “Whenever these students interview, the interviews always go to this class and what they did,” he said. “It generates a lot of interest.”
However, without an established major or minor in data science, structuring a program with multiple units that support students’ diverse passions is challenging; Rossi acknowledged that perhaps the solution is to take such a program out of a department and run it as an interdisciplinary project, as Paffenroth suggested.
Obtaining data with which to work is undoubtedly important. UD’s Department of Mathematics has a wet lab that allows students to conduct real experiments and utilize the resulting data. Rossi also recommended Kaggle, Google DeepMind, and PITCHf/x (which reports data from all live pitches at baseball games) as sites that provide workable data. Learning how to wrangle data is imperative, because more often than not data is complicated and jumbled. Integrative Data Science at WPI introduces students to this notion with a project about absorbing Twitter feed. “The idea is not to learn anything about the data, but to learn how ugly data can be,” Paffenroth said.
Universities themselves offer vast access to raw data as well. Bennett suggested that students reach out to fellow students, colleagues, or even admissions office employees. “I bet if any of you work in a university or college where people do research, those people have data coming out of their eyeballs,” she said. “You get the live data, it’s exciting, you know who generated it, and it’s messy.”
Paffenroth briefly discussed the Preparation for Industrial Careers in Mathematical Sciences (PIC Math) program, which both engages students in industry-based research problems and trains faculty in addressing data science in classroom settings. Rossi praised industry workshops, which occur frequently and are reasonably-priced. “You’ll be shoulder to shoulder with an expert,” he said. “There’s going to be people there who are hired guns, really knowledgeable. And then there will be people there who want to learn and contribute.”
All three panelists encouraged faculty to be flexible, and experiment with personalized coaching over traditional lecturing. “You always have to work with your students,” Rossi said. “But don’t confuse that with the idea that they need to learn math the way you learned math. They can have a rich mathematical experience in a different order or with different topics.” Paffenroth echoed Rossi’s sentiment. “It’s important not to try to put everything in the ‘math’ category,” he said. “You need to figure out what fits best where.”
|| Lina Sorg is the associate editor of SIAM News.