“Big Data” was a pervasive theme at the 2013 SIAM International Conference on Data Mining, held in Austin, Texas, May 2–4.* Along with discussions of algorithms, machine learning techniques, and statistical methods and their applications, the program included a panel discussion, held the evening of May 3, on the implications of Big Data for the practice of data science.
The panel session, “On Being a Successful Data Scientist in the Big Data Era,” was moderated by Srinivasan Parthasarathy of Ohio State University; the panelists provided perspectives from academia (Chris Jermaine, Rice; Raymond Mooney, University of Texas–Austin; and Peter Szolovits, MIT); government (Chandrika Kamath, Lawrence Livermore National Laboratory); and industry (L.V. Subramaniam, IBM Research–India). The panelists addressed three general questions: What has been the impact of the widespread publicity surrounding Big Data and its potential applications? What research challenges does Big Data present? How can we prepare students to be successful data scientists?
The need for managing “4V data” is widespread. Image courtesy of L.V. Subramaniam, IBM Research–India.
Just one example of the recent flood of publicity on Big Data was a 16-page special section devoted to the topic in The New York Times on June 20. The lead article summarizes the promise of Big Data for Times readers as follows:
“The story is the same in one field after another, in science, politics, crime prevention, public health, sports and industries as varied as energy and advertising. All are being transformed by data-driven discovery and decision-making. . . .
“Big Data is the shorthand label for the phenomenon, which embraces technology, decision-making and public policy. Supplying the technology is a fast-growing market, increasing at more than 30 percent a year and likely to reach $24 billion by 2016, according to a forecast by IDC, a research firm. All the major technology companies, and a host of start-ups, are aggressively pursuing the business.”
The panelists noted positive aspects of the extensive publicity: It increases the visibility of the field, expands the set of applications, and draws talent into the field. At the same time, they expressed reservations, citing increasing concerns about privacy and a possible backlash if exaggerated predictions are not met; they also mentioned ways in which the challenges posed by Big Data have been distorted—in particular, the emphasis placed on technology at the expense of data analysis.
Many of the research challenges discussed stem from what one panelist termed “4V data”—data with volume, velocity, variety, and veracity. Velocity is a consideration because some or all of the data may not be static, as in the case of streaming data. Variety is a reference to the different formats or modes in which data can appear. Veracity is 1/uncertainty, which means that analysis implies reasoning under uncertainty. Another challenge arises from the variety of applications. As one panelist put it, a data scientist has to have a 360-degree view, as customers do not know what to do with big data. Finally, various instruments (surveys, experiments, simulations) need to be designed so that the results of the analysis can be compared to real-world data.
The panel, having identified these challenges, turned to the question of designing training programs for data scientists. The panelists agreed that some elements of training are straightforward, including courses in algorithms, data mining, machine learning, programming, databases, and statistics. Other elements are somewhat more difficult to achieve. For example, students should have experience working with real data, which implies, in turn, that they need to understand the domain from which the data is derived. Finally, data scientists need to master tools like visualization, and they must have the ability to present their results clearly in presentations and in writing.
*Links to the program, abstracts, and online proceedings can be found here.