By Paul Davis
In a world awash with data—and plenty of merchants, politicians, and others seeking leverage from that data—we all worry about privacy, even if we lack a sharp definition for it. Perhaps, like pornography, we think we know it when we see it. Or when we lose it.
On the other hand, we give little thought to the validity of analyses based on those vast stores of data. Important policy decisions resting on such data—say, those involving fairness in granting home loans—come and go with little public comment.
In her invited talk at the 2016 SIAM Annual Meeting, held in Boston this July, Cynthia Dwork (Microsoft Research) addressed these and other issues by describing differential privacy, an intuitively appealing but rigorously-defined concept seeing such widespread implementation that its expert practitioners are increasingly in demand. The result of her work and that of her collaborators is a precise formulation of a socially useful idea of privacy.
Massive databases often serve important social ends, such as the identification of links between smoking and cancer, early detection of epidemics from patterns of over-the-counter drug purchases, or extraction of evidence of discrimination from piles of loan applications. Putting aside unauthorized access, the challenge of the last half-century has been preserving the privacy of those whose financial records, in the last example, are part of such a database while permitting analysis of the full body of data for its potential social benefits.
Dwork, a distinguished scientist at Microsoft Research, began her presentation by discrediting some commonly trusted privacy strategies. For example, she stated flatly, “De-identification isn’t.” Anonymity offers no protection if some collective statistics drawn from the data change sufficiently when one individual is removed from the data set. A so-called differencing attack can identify a particular person as a target from one such change, then extract a hitherto unknown fact from another change.
Being a needle in a haystack does not help either. A hacker who holds just a bit of additional genetic information about a few individuals could first determine which of them were represented in a large genetic database, then whether each individual has any of the diseases recorded there. Dwork observed that the availability of “overly accurate estimates of too much data means no individual privacy.”
Some form of privacy protection is essential to entice individuals to risk participating in databases that provide socially useful findings. Data which teaches that smoking causes cancer, for example, could expose individual participants who are revealed as smokers to the risk of higher insurance premiums. How might such protections be formulated, then reliably implemented?
Dwork pointed to an idea advanced in 1977 by Tore Dalenius: privacy-preserving data analysis will not reveal anything about an individual that is not already known. Large-scale analysis of the data remains worthwhile, but such analytic outcomes need to be stable—equally likely—whether or not a given individual is in the population. This sort of stability preserves individual privacy while protecting against over-fitting. “Privacy and generalizability are aligned,” Dwork said.
Differential privacy is achieved by applying a mechanism or algorithm to the data before its release; e.g., masking by adding carefully chosen noise. For any two sets of data that differ by a single element, and for all possible outcomes of the privacy mechanism applied to the data in those sets (say, available data entries with noise added), the masking algorithm provides \(\epsilon\)-differential privacy if the ratio of the probabilities of the noisy data attaining a given outcome on each of the two sets is uniformly bounded above by \(exp(\epsilon)\).
Socially, differential privacy offers a probabilistic promise to protect individuals from harm due to their choice to be in a database. Mathematically, the increase of the probability of changing a given outcome by removing one sample from the masked data is bounded; for highly confidential information, that bound can be made close to unity. For example, a health insurer should not be able to detect a change in a count of the smokers in a data set when a prospective customer is removed from it.
This added noise must be chosen with care. For instance, noise that is symmetric about the origin can be averaged away using the answers to repetitions of a fixed query. So-called sensitive queries—those whose answers can vary widely between data sets that differ in only one element—require more noise to conceal differences than do relatively insensitive queries such as counting. A count of smokers, for example, changes by at most one with the removal of a single individual’s data.
Dwork and her colleagues have established that the ubiquitous statistical databases—those with vector-valued data—can provide \(\epsilon\)-differential privacy when the added noise is symmetric Laplacian with standard deviation proportional to the sensitivity of the potential queries and inversely proportional to \(\epsilon\). In Dwork’s expressive phrasing, these databases “smush out the noise” to defend against more sensitive queries or to achieve tighter differential privacy. Differential privacy depends upon the queries and the level of protection, not upon the data itself.
One surprising application of these ideas is the reduction of the false discovery rate in adaptive queries of large statistical databases. Applying a differential privacy mechanism prevents queries from revealing anything distinctive about subsets of the underlying database. Since false discoveries are results about a data sample that are not characteristic of the database as a whole, imposing differential privacy prevents their occurrence.
Another application is the provision of reusable hold-out sets for learning. The standard paradigm is to learn (optimize) on a training set, then check against a hold-out set. If the hold-out set is protected by a differential privacy mechanism, then the testing queries reveal nothing about the hold-out set and it can be reused repeatedly.
The algorithmic details and the nuances needed for various important data settings are developing rapidly. Though the devil may be in those details, so are the jobs. Dwork reported that the lack of a competing theory of privacy-preserving data analysis means that those who know differential privacy can find jobs at Apple and the U.S. Census Bureau, among others. Indeed, the chief scientist of the Bureau “is a strong advocate for differential privacy,” a powerful endorsement indeed of the socially valuable, mathematically rigorous discoveries she reported.
Paul Davis is professor emeritus of mathematical sciences at Worcester Polytechnic Institute.