By Lina Sorg
Cynthia Dwork of Microsoft Research discusses differential privacy in the analysis of large quantities of data. Staff photo.
In this current age of big data, analyzing and interpreting data correctly is quickly becoming a huge challenge. Data analysis increases not only the risk of spurious scientific discovery but also compromises privacy. In her plenary talk, “Privacy and Validity in the Land of Plenty,” at the SIAM Annual Meeting in Boston, MA, Cynthia Dwork of Microsoft Research discusses the challenges and methods of preserving privacy in data analysis. She overviewed mathematical methods ascertaining that conclusions drawn by analyzing big data sets are as accurate as possible.
Dwork first became interested in privacy in the modern age through the work of Helen Nissenbaum. She began her talk with a brief discussion of the general assumption that privacy-preserving data analysis means that we shouldn’t learn anything new about the subject in question. “The problem with this,” Dwork said “is that if we’re not going to learn anything new about people, what is the point of the data set?” However, there is a workaround for this problem; it’s not a privacy compromise if analysts would have learned the same thing had the subject not been replaced by another random member of the population. This is the case in the general solution concept of differential privacy, which is robust to the networked world and reveals fundamental truths about computing stability. “We’re learning about people, but not specifically learning about the individuals in the data set,” Dwork said.
Dwork also analyzed the problem of statistical validity in adaptive scenarios where new questions and new studies depend on results and outcomes from previous studies. There is a disconnect between theoretical results and data analysis practice, since in practice, data is shared and reused with hypotheses and generation of new analyses on the basis of data discovery and conclusions from previous analyses. Dwork described studies to guarantee the validity of statistical inference in adaptive data analysis, since most datasets are representative of populations as a whole; the world “plenty” in her talk’s title refers to one giant, collective data set used by all researchers. “Science is by nature an adaptive process,” she said, “Everyone is studying the same data set and they all influence each other. If in the process of adaptive exploration the analyst finds a query for which the dataset is not representative, then she must have learned something significant about the data.”