| July 19, 2012

A Data Deluge in April

April was a busy month for those involved in the mathematical aspects of data science. As regular readers of SIAM News will know, April is Mathematics Awareness Month and the 2012 topic was Mathematics, Statistics, and the Data Deluge (see www.mathaware.org). An opportunity to address some of the more challenging questions in this field, while raising others, came at the 12th SIAM International Conference on Data Mining, held April 26–28, in Anaheim, California. The popularity and timeliness of the topic were reflected in the best attendance at the conference so far: nearly 300 participants, with papers presented by a set of authors who, true to the conference name, had come from Australia, Belgium, China, Germany, Italy, Japan, Netherlands, Singapore, Switzerland, Turkey, the UK, and the US.

SDM12 was also the first SDM conference after the formation of the SIAM Activity Group on Data Mining and Analytics. Unlike other SIAM conferences, SDM preceded the creation of the SIAG by nearly a dozen years, focusing on the mathematical aspects of data mining and filling a gap left by other conferences in the field.

At the heart of SDM12 were four widely ranging keynote talks, on learning systems in healthcare, network science, transfer learning, and information retrieval. Bharat Rao of Siemens Healthcare focused on practical challenges faced in the application of data mining techniques to problems in healthcare. Medical treatment is largely stochastic, he emphasized: Drugs work in some patients and not in others, leading to individualized care. Data mining techniques play a role in decision support, in the aggregation of data from different sources, and in the injection of knowledge via probabilistic inference. A consequence has been not only a need to combine structured and unstructured data, but also challenges resulting from the breakdown of assumptions on algorithms, such as the availability of i.i.d. data, balanced datasets, and ground truth for comparison.

Susan Dumais, from Microsoft Research, provided illuminating insights on improved web retrieval, achieved through exploitation of the facts that both web content and interactions of a user with the web change over time. To analyze data obtained by conducting large-scale web crawls, her team came up with metrics for determining when a web page had changed and by how much, and identified patterns that reflected revisitation of a site by a user. It was by exploiting the temporal dynamics pervasive in information systems that they were able to improve retrieval.

Qiang Yang, from Hong Kong University of Science and Technology, discussed ways to benefit from cross-domain transfer learning. He considered the case in which major assumptions in traditional learning—namely, that test and training data are in the same feature space and follow the same distribution—are no longer valid. In such problems, he said, ideas from the “transfer of learning,” a concept from educational psychology theory, can be applied. The set of references Yang listed at the end of his talk clearly indicated that transfer learning lives up to its name by borrowing ideas from many domains.

A fascinating glimpse into network science was provided by Noshir Contractor of Northwestern, who has been investigating what prompts individuals to form teams and what leads to a successful team. He observed that team science is increasingly carried out by researchers at different universities, who tend to produce higher-impact work than comparable co-located teams or solo scientists. His analysis of a dataset of interdisciplinary scientific teams that submitted proposals to NSF resulted in some interesting findings: Researchers from top-tier institutions and those with high scores on the impact-rating H-index, for example, were less likely to collaborate, and those with tenure were more likely to collaborate. Other interesting insights into what makes a successful team are available at the SDM12 website, which includes slides from all the keynote presentations.

A major part of the SDM conference is the presentation of refereed papers, some as talks in parallel sessions, others as two-minute “spotlights” given prior to a poster session on the first day, along with a welcome reception. The topics covered in this year’s papers include pattern mining, time series and sequence analysis, clustering, social media, and graphs, as well as applications from healthcare, climate, and networks. All papers, including those from past SDM conferences, are available at the SDM proceedings website. The best papers from the conference (identified by a committee from the papers that were highly ranked in the review process) will appear in expanded form in a special issue of the Wiley journal Statistical Analysis and Data Mining.

Complementing the paper sessions were several workshops, as well as five tutorials, on distance metric learning, discovering roles and anomalies in graphs, multi-task learning, privacy-preserving medical data sharing, and advice on doing good research and getting it published in top venues. The last of these was especially relevant to the large number of students who attended SDM, many as the first authors of presented papers. Other student activities included a doctoral forum, which gave nearly fifty students the opportunity to receive feedback on their work; among this group were students from minority and underrepresented groups who attended a workshop de-signed to increase the participation of these groups in data mining. Another student activity was an NSF panel, “The Case for Interdisciplinary Research: Challenges, Pitfalls and Career Advice,” moderated by Srinivasan Parthasarathy of Ohio State University.

Two minisymposia complemented the rest of the program. For one of them, local area chair Yan Liu had invited several local leaders in data mining, including Charles Elkan (UC San Diego), Eamonn Keogh (UC Riverside), Shanghua Teng (University of Southern California), and Padhraic Smyth (UC Irvine), to present their recent work. The other minisymposium was organized jointly by the SIAGs on Uncertainty Quantification and on Data Mining and Analytics, to explore the common threads in the two areas. Presenters included Hadi Meidani (USC), Sonjoy Das (SUNY, Buffalo), Julien Emile-Geay (USC), and Omar Knio (Duke).

The SDM12 sponsors—the journal Data Mining and Knowledge Discovery, Google, IBM Research, NSF, and the SIAM travel award program—provided generous support, in particular for student travel. The organizing and program committee members put together an exciting meeting that resulted in packed sessions, many with standing room only, despite the venue with its own competing attractions. I hope that this short article will bring SDM and SIAG/DMA to the attention of the broader SIAM community. On behalf of the other SIAG/DMA officers—Charu Aggarwal (IBM Research), Huan Liu (Arizona State), and Michael Mahoney (Stanford)—I invite you to join us early next May in Austin, Texas, for another exciting conference.

Chandrika Kamath is a researcher at Lawrence Livermore National Laboratory, where she is involved in the analysis of data from scientific simulations, observations, and experiments. She chairs both the SDM Steering Committee and SIAG/DMA.