Hello, friends,
In today’s world, anonymity is increasingly rare. Public places where we used to be anonymous—walking down the street, shopping in a store, attending a protest, browsing the internet—are now places where technology is often being deployed to identify us.
This is an especially pressing problem for the U.S. Census Bureau, which needs the participation of all residents for its vital goal of ensuring equal representation in Congress and uses the promise of privacy to encourage it. The bureau takes extreme measures to protect the data of participants, including requiring employees to swear lifetime oaths to not disclose raw data.
But the ever-increasing power of computers and the availability of big datasets of personal information means that supposedly anonymous datasets can increasingly be re-identified.
In 2016, the Census Bureau tested whether it could re-identify information from its published tables “applying modern optimization methods along with relatively modest computational resources.” The result was that the Census was able to reconstruct the geography, sex, age, and ethnicity of 46 percent of the U.S. population. It then linked those records with commercial databases and was able to re-identify more than 52 million people.
This was a wake-up call that led the bureau to turn to a new approach, called “differential privacy,” that it hoped would better protect the public. As Todd Feathers reports this week in The Markup, the changes are controversial—with many states and civil rights groups arguing that the quality of the Census has been compromised and that the changes are disproportionately affecting minority groups.
What is differential privacy and how does it work? I spoke this week with one of its inventors, Cynthia Dwork. She is the Gordon McKay Professor of Computer Science at the Harvard Paulson School of Engineering, the Radcliffe Alumnae Professor at the Radcliffe Institute for Advanced Study, and a Distinguished Scientist at Microsoft Research.
The interview is edited for brevity and clarity.
Angwin: Let’s start with the most basic question: What is differential privacy?
Dwork: The English language definition of differential privacy is that the outcome of any analysis that’s done in a differentially private way is essentially the same, independent of whether any individual opts in or opts out of the dataset. So the presence or absence of a small number of people can’t change the conclusions that one draws from the statistical analysis of the data, even if that person is an outlier.
Suppose you have a medical dataset, and you study this dataset and you learn as a result of the study that smoking causes cancer. Some people could be harmed by others finding out this basic fact about their life.
A smoker who is publicly known to smoke may have their insurance premiums go up as a result. We want to be able to learn the basic truth that smoking causes cancer whether any particular individual is in the dataset or not. Differential privacy ensures this is the case.
Differential privacy disentangles harms that can arise from the statistical teachings of the dataset about the population as a whole from the harms that could come to an individual by choosing to join or not join the dataset.
Angwin: Was the motivation for developing differential privacy to protect people and give them confidence to be part of bigger datasets?
Dwork: That’s exactly right. The motivation was strongly to protect people and to encourage them to allow their data to be used for important discoveries without worrying about personal harms that could arise as a result of participating.
And one of the things that we felt very, very strongly about was that it was important to protect outliers because outliers are often those most in need of protection.
Angwin: You have been described as inventing differential privacy. How did that come to be?
Dwork: I became very interested in privacy because of conversations with the philosopher Helen Nissenbaum. She was talking about: What was the meaning of privacy in public when there are video surveillance cameras everywhere? This was before she came up with her really beautiful work on contextual integrity.
I had done a lot of work in cryptography, so various aspects of privacy were quite familiar to me, but I wanted to find a piece of the sociological privacy question that I could really sink my teeth into and where math could maybe do something.
So I settled on the question of privacy preserving statistical analysis. How do you figure out statistics about a population in a way that really preserves—and provably preserves—the privacy of everybody in the population?
I came in contact with Kobbi Nissim, who, together with Irit Dinur, had also been thinking about privacy in this setting and had some very negative results—which you’ve probably heard of as the database reconstruction theorem. They showed, roughly speaking, that overly accurate answers to too many statistical queries [of a database] could completely destroy privacy. And they thought that this was just the death knell for any kind of privacy preserving data analysis.
Now, one of the things that you think about as a computer scientist is how much computation do you need to carry out certain tasks?
I had been thinking about privacy on the scale of the Hotmail user database, which had about 500 million users. I realized no one could carry out 500 million queries without being detected and kicked off the system.
Kobbi had come to visit me, and I said, “What happens if we cut off the questioning early and don’t allow too many questions to be answered?” And that was the beginning of what eventually became differential privacy. Overall, there are four co-inventors of differential privacy.
Angwin: You, Nissim, Frank McSherry, and Adam Smith wrote the seminal paper on differential privacy in 2006. At that time, privacy in large datasets seemed like somewhat of an abstract problem as opposed to a real-life threat, whereas now it is a common concern.
Dwork: I thought about it as a problem that’s already real, that people don’t know it and people are going to realize it.
One place where it comes up a lot now is in machine learning, for instance learning to recognize spoken texts or being able to suggest texts when you’re typing. Those suggestions that are being made to you come from analysis of other people’s data.
So differential privacy has become important in industrial work. If you have an iPhone, it’s there. It’s in every device that Apple sells. It’s used heavily in the Chrome browser. It is being used by Microsoft for telemetry in Windows.
Angwin: The Census Bureau’s adoption of differential privacy is getting a lot of attention these days. As we recently reported in The Markup, the bureau has struggled to implement it in a way that still produces the results that people want from Census data. How easy will it be for them to fix it?
Dwork: The Census decided to use differential privacy when they realized that the attacks that Dinur and Nissim had come up with in 2003 could be launched against Census publications.
As I mentioned, overly accurate estimates of too many statistics completely destroys privacy. And the Census publishes billions of statistics on the data of 308 million people. So they definitely fall into the “too many overly accurate” category.
So they bit the bullet and tried to find an implementation that would give them sufficient statistical accuracy and enough privacy. Census came up with an algorithm which they call the TopDown algorithm.
This is a good time to emphasize that differential privacy is not a specific algorithm. There are many ways of carrying out a task in a differentially private fashion, and each may have different privacy and accuracy trade-offs.
The Census applies differential privacy first, and then they do something called post-processing, in which they take the differentially private outputs—some of which may be negative—and force all of the outcomes to be non-negative.
Consider this example: Suppose we want to release a statistic that 500 people live somewhere. With differential privacy, noise will be added to that statistic. It may be positive, or it may be negative with equal probability. It’s chosen randomly.
An analogy for that would be, suppose you have a perfectly fair coin and you flip this coin a thousand times. The expected number of heads is 500, but on any given trial, when you flip it a thousand times, you’ll either get a bit more than 500 or a bit less than 500. But, on average, it is 500.
If every time you got something less than 500, you decided to report 500, that would be biased because when you average that out you don’t hit 500. That is what TopDown does.
That has some bad consequences for accuracy and bias, in the statistical sense that the expected value of the count that is produced is not equal to the actual count.
Most of the error in TopDown is being introduced in the post-processing.
If I could change one thing about their method, I wish they would either abandon the post-processing entirely or additionally release the true differentially private noisy counts.
Angwin: Mathematically, could you protect privacy without adding noise to the data, but just by limiting the number of queries to the database?
Dwork: Yes. You would still need to add a little bit of noise for privacy, but the impact would be much smaller. A good comparison of the trade-offs is to think about X-rays.
Medical researchers understand that exposure to X-rays adds up and can eventually lead to cancer-causing doses. And so whenever there’s a question about taking an image, the question is, is it worth adding to your cumulative exposure for this particular image?
In a confidential dataset, whenever you produce a statistic, you’re experiencing a little bit of privacy eroding radiation. The dose matters.
If you want to be very accurate, that’s like a high dosage. You’ve lost a lot of privacy. If you’re willing to live with a less accurate statistic, then it’s a lower dosage.
Angwin: Researchers who re-identify data from large datasets often do so by combining the data with commercially available datasets of personal information. If we lived in a world where commercial datasets about all of us weren’t available for pennies, would this be a different problem?
Dwork: One aspect of differential privacy is that it is future-proof, meaning that the privacy guarantees hold independent of additional information that becomes available later. It protects against future commercially available datasets.
But, it would probably be better if there weren’t wild amounts of data about us everywhere.
As always, thanks for reading.
Best,
Julia Angwin
Editor-in-Chief
The Markup