A basic understanding of ethics and privacy is essential to operate credibly in the data science world.
Does anybody here do data science? Many SIAM readers are being encouraged to stand up and be Spartacus. We are urged to create data science degrees, to apply for data science research grants and to hold data science conferences. The field clearly opens up new and worthwhile employment opportunities for mathematically-literate graduates. It also offers significant research challenges, especially at the interface between applied mathematics and applied statistics.
Although data science is necessarily outward-facing and collaborative (the data must come from some application field) we could argue that the mathematical sciences lie at the heart of the field. Hence, many of the fundamentals of data science---models, algorithms and visualization tools---are in our existing comfort zones. However, to operate credibly in the data science world, we also need at least a basic understanding of ethics and privacy. We should be able to articulate and defend our views on these matters, to point out which issues are particularly relevant for mathematicians, and to be aware of current regulations and best practices.
If we are preparing students for data science careers, they must be ready when potential employers, or clients, ask questions like “What’s the provenance of this data?”, ”What steps have been made to protect privacy?”, “How do we explain these experiments in a way that keeps our customers on board?” (My own department runs a Research & Scholarship graduate module, in which a popular option is for students to write an essay on ethical considerations concerning their specific research topic.) Similarly, we need coherent privacy strategies on data collection and use in our research grant proposals, and, from my own personal experience, we can expect to be grilled on these issues if the funding decision depends on face-to-face interviews.
For these reasons, I am grateful to my colleague Peter Grindrod (University of Oxford), who has drawn on his considerable experience in developing tools for data-rich industries, advising research funders, and working with governments and utility industries, but also in developing algorithms and proving theorems. His article “Beyond Privacy and Exposure: Ethical Issues within Citizen-Facing Analytics” in Philosophical Transactions of the Royal Society A, sets out some views on ethical codes, standards, and practices in citizen-facing analytics from the perspective of an applied mathematician in a fast-moving digital world. It is a valuable, example-laden think piece that distinguishes between the worlds of academics, industry researchers, business leaders, policymakers, and citizens. It also introduced me to the word, “prosumer”. (A Google search corrected my initial guess that a prosumer must be somebody who assumes they are a professional!).
A key message from the article is that “the governing forces for analytics, and especially analytics concerning citizens' behaviours and transactions, depend on which of three spheres of operation an institution is in: corporate, public sector/government, or academic. Confusion arises when institutions switch spheres or when ethical codes, standards and practices developed for one sphere are applied to another.” For example, in the high-profile publication by Kramera et al., data collection procedures that appear to satisfy Facebook’s terms and conditions nonetheless attracted widespread condemnation when the governing forces were those of responsible academic research - notably around obtaining permission from users and approval from an internal review board. Quoting from the resulting Editorial Expression of Concern “It is nevertheless a matter of concern that the collection of the data by Facebook may have involved practices that were not fully consistent with the principles of obtaining informed consent and allowing participants to opt out.”
As a mathematical scientist who works with data, my default position used to be “I get to do the creative math stuff and I’m not qualified to speak about ethics or privacy,” but I have come around to the view that “I’m lucky enough to do the creative math stuff and I need to appreciate how this might impact ethics and privacy”.
On a related note, having recently attended the SIAM Annual Meeting and the SIAM Workshop on Network Science in Boston, it was brought home to me that privacy considerations can motivate challenging new research topics in mathematics. And we can put the spotlight on the data or the algorithm.
From the algorithm perspective, Cynthia Dwork, a distinguished scientist at Microsoft Research, opened the Annual Meeting with a talk on Differential Privacy. If we have a database containing personal information about a group of people, then, loosely, a differentially private algorithm is one for which removing a person’s record makes little difference to the result. This makes it hard for adversaries to de-anonymize the data. The formal definition of epsilon-differentially private looks like a sort of discrete global Lipschitz condition, and there are key challenges around designing and verifying such algorithms, as well as analyzing the trade-offs between privacy and utility.
At the network science workshop, Alex Pothen from Purdue University looked at a technique where the data is processed before being seen by an algorithm. Suppose the data that we hold about each individual takes the form of positive integers. We may want to “mask” some data entries so that any individual i shares the same record as at least k(i) others, for some function k(i). So we replace some entries in the database with a *. If there were three categories (weight, height, religion) represented by integers, suppose mine were [1 3 7] and yours were [1 5 7]. Then perturbing these to [1 * 7] and [1 * 7] would make the two of us indistinguishable. So how do we achieve a target level of obfuscation in a large database with the minimum number of masks? Pothen showed how the problem could be couched in terms of graph matching and edge covers and discussed new techniques that can be implemented efficiently on high performance computers.
In summary, when it comes to ethics and privacy, we should all be able to contribute to the debate, and some of us can contribute to the science.
|| Des Higham is a numerical analyst at the University of Strathclyde in Glasgow. He has research interests in stochastic computation, network science and city analytics. He is a SIAM Dahlquist Prize winner, a SIAM Fellow and a Fellow of the Royal Society of Edinburgh.