Exploring Privacy and Validity in the Land of Plenty

Cynthia Dwork of Microsoft Research discusses differential privacy in the analysis of large quantities of data. Staff photo.

In this current age of big data, analyzing and interpreting data correctly is quickly becoming a huge challenge. Data analysis increases not only the risk of spurious scientific discovery but also compromises privacy. In her plenary talk, “Privacy and Validity in the Land of Plenty,” at the SIAM Annual Meeting in Boston, MA, Cynthia Dwork of Microsoft Research discusses the challenges and methods of preserving privacy in data analysis. She overviewed mathematical methods ascertaining that conclusions drawn by analyzing big data sets are as accurate as possible.

Dwork first became interested in privacy in the modern age through the work of Helen Nissenbaum. She began her talk with a brief discussion of the general assumption that privacy-preserving data analysis means that we shouldn’t learn anything new about the subject in question. “The problem with this,” Dwork said “is that if we’re not going to learn anything new about people, what is the point of the data set?” However, there is a workaround for this problem; it’s not a privacy compromise if analysts would have learned the same thing had the subject not been replaced by another random member of the population. This is the case in the general solution concept of differential privacy, which is robust to the networked world and reveals fundamental truths about computing stability. “We’re learning about people, but not specifically learning about the individuals in the data set,” Dwork said.

At the beginning of her talk, Dwork addressed what she called “the statistics masquerade,” during which seemingly-innocent questions reveal private truths about individuals. For example, asking (1) how many members of the House of Representatives have sickle cell trait and (2) how many members of the House, other than the speaker, have the trait compromises the privacy of the speaker because his status is inadvertently revealed. Photo from Cynthia Dwork, presentation at AN16.

According to Dwork, the so-called English language definition of differential privacy is that “the outcome of any analysis is essentially equally likely, independent of whether any individual joins, or refrains from joining, the data.” Thus, differential privacy is a strong privacy guarantee which often permits highly-accurate data analysis. Dwork described basic algorithmic techniques for achieving differential privacy, emphasizing that the stability of an algorithm is necessary to prevent overfitting. The structure of a differentially private algorithm allows researchers to minimize cumulative privacy loss. This distinction sets differential privacy apart from other techniques of maintaining privacy and prevents false discoveries from adaptivity in data analysis.

Dwork also analyzed the problem of statistical validity in adaptive scenarios where new questions and new studies depend on results and outcomes from previous studies. There is a disconnect between theoretical results and data analysis practice, since in practice, data is shared and reused with hypotheses and generation of new analyses on the basis of data discovery and conclusions from previous analyses. Dwork described studies to guarantee the validity of statistical inference in adaptive data analysis, since most datasets are representative of populations as a whole; the world “plenty” in her talk’s title refers to one giant, collective data set used by all researchers. “Science is by nature an adaptive process,” she said, “Everyone is studying the same data set and they all influence each other. If in the process of adaptive exploration the analyst finds a query for which the dataset is not representative, then she must have learned something significant about the data.”

Lina Sorg is the associate editor of SIAM News.