Several years ago, as the U.S. Census Bureau began to prepare for its 2020 count, it was confronted with an existential problem.
A growing body of academic research was providing evidence that machine learning systems, combined with the availability of large commercial datasets about Americans, were making it possible to personally identify people from information in confidential datasets—like the Census.
The bureau, which relies on Americans willingly sharing their private information under the assurance they won’t be personally identifiable, decided to conduct its own test. In 2016, it found that by combining a relatively small fraction of the statistics it published after the 2010 Census with commercial datasets available at the time, anyone could undermine the Census’s current privacy system and reconstruct the name, location, and key demographic characteristics of about 52 million people.
“If they’d used more statistics, it could have been worse. If they’d used more rich commercial datasets, it could have been worse,” said Cynthia Dwork, a computer science professor at Harvard University.
That discovery kicked off one of the most consequential changes in the Census Bureau’s history in 2017—a privacy overhaul designed to prevent even the most advanced snoops from using the hundreds of billions of statistics the bureau publishes every 10 years to link confidential data back to individuals and exploit their highly personal information.
But that shift—now in its final stages—isn’t going quite as planned.
The issue: The census has shifted to “differential privacy,” a method of measuring how much “noise”—basically fake information—to add to the data in order to minimally impact the quality of the dataset while mathematically ensuring that individuals can’t be identified. But many states and civil rights groups argue that this approach has compromised the quality of census, rendering the data unusable, and that the changes are disproportionately affecting minority groups.
On March 10, Alabama sued the bureau to prevent it from implementing differential privacy. That suit, filed in the U.S. District Court for the Middle District of Alabama, Eastern Division, has the support of 16 other states. Civil rights groups have raised the alarm as well, worried that the change will dilute minority voting blocs during states’ upcoming redistricting processes and make it harder to follow the Voting Rights Act.
The Census Bureau did not respond to multiple requests for comment. In its filings in the Alabama case, the bureau has argued that sticking to the same privacy protocol it used in the 2010 Census would violate its legal requirement to protect survey respondents’ confidentiality due to subsequent advances in machine learning and big data, and that differential privacy is the only available method that fulfills that obligation while still allowing the bureau to publish a wide array of statistics.
But the bureau, which is racing to get data to the states so they can start their redistricting processes, has its defenders as well.
Many independent privacy and cryptography experts are adamant that differential privacy is necessary to save the Census.
“Differential privacy is essential to ensuring the accuracy of future Census surveys,” said John Davisson, senior counsel for the Electronic Privacy Information Center. “If you fail to protect the privacy of the Census survey today, you’ll get lower responses tomorrow because people won’t trust the confidentiality of their data.”
Here’s a rundown of the issue.
What Exactly Are the Privacy Changes?
The Census Bureau is required by law to ensure that the data it publishes cannot be used to identify individual respondents. It’s done this in a variety of ways over the decades, but beginning with the 1990 Census the bureau began introducing noise through a variety of modern techniques.
The Census did things like swapping characteristics of households in rural areas, where the number of households is small, making it easy to guess which information belonged to whom. It also assigned statistically likely characteristics to a small number of non-respondent or easily re-identifiable addresses—less than half a percent of the addresses counted during the 2010 Census.
As a result, the modern Census has never been a simple tabulation of survey responses. And accuracy decreases as you move from larger questions where the numbers will be precise, like the population of an entire state or the number of White men who are employed, to smaller groups where the bureau is forced to swap and impute, like the characteristics of a rural township or the number of male Native Alaskans in a particular district who are in a same-sex relationship.
But the bureau’s test—in which it was able to use de-identified Census statistics and commercial datasets to match more than a third of the population, by name, to the supposedly confidential information they shared on Census surveys—raised alarms.
As a result, the bureau chose to implement differential privacy, a relatively new technique for guarding against re-identification.
It’s a mathematical way of measuring how well the algorithms used to add noise to data fulfill their purpose–ensuring both statistical accuracy and confidentiality. And unlike previous techniques the bureau has used, differential privacy provides end users of Census data with the information necessary to calculate the margins of error created by the added noise.
But the level of statistical accuracy is the key question. Each time a true statistic about the database is revealed, it reduces by some small percentage the overall privacy of the original data. And each time noise is introduced, accuracy goes down. So when implementing differential privacy, the database owner must decide on a “privacy budget”—where to land between 100 percent accuracy and 100 percent privacy.
So What’s the Problem?
Many state officials are alarmed that the data they’ll be using to draw congressional districts and allocate funding will not be 100 percent accurate—or, at least, potentially less accurate than what they’re used to. And for the first time, the bureau will only guarantee the absolute accuracy for three statistics: total population by state, the number of housing units in each Census block, and the number of group quarters facilities, such as college dormitories or nursing homes, by type in each block.
The Census also plans to “post-process” the data—cleaning up the differentially private statistics to ensure that there are no confusing numbers, like Census blocks with negative populations or fractions of people.
But the process can lead to some odd quirks.
For instance, when Washington State officials examined an early demonstration set containing 2010 data run through the new system—it found 401 Census blocks where the entire population was over 85 years old and 3,353 where the entire population was under 14. An Alabama analysis of the same dataset showed 13,000 blocks where there were children but no adults.
3,353
Census blocks in Washington State, in an early demonstration set of the new system, in which the entire population was under 14.
In its lawsuit against the Census Bureau, Alabama and 16 other states that have submitted an amici curiae brief argue that it will be impossible for them to adhere to the Voting Rights Act’s fair redistricting requirements if the data they’re working with is inaccurate. They point out that minority groups are the most likely to be moved around and have noise infused into their numbers because, due to their smaller sizes, individuals in those groups are more at risk for re-identification.
An analysis by the Mexican American Legal Defense Educational Fund and Asian Americans Advancing Justice of demonstration data released in November found that the Census’s system was altering populations in a way that made communities appear more homogenous on racial lines and increased the population in rural areas while reducing it in urban ones. Should that kind of data be used for redistricting, the groups wrote in their report, it may lead states to draw congressional districts that are supposed to be majority-minority (more than 50 percent minority)—a requirement of the Voting Rights Act—but in fact are not.
“We currently have grave concerns,” the groups wrote. And while the Census Bureau has promised improvements, “there has been a dearth of transparency, clarity, and engagement with external stakeholders.”
So What Happens Now?
The Census Bureau says that the demonstration data does not represent the final product, and that it is implementing changes based on the feedback it receives from states and other stakeholders.
The agency has also defended its approach on the grounds that there is no other feasible way for it to publish Census statistics without violating its legal obligation to protect privacy. In its court filings in the Alabama case, the bureau said it performed an empirical analysis of its options and determined that differential privacy provided the best balance between accuracy and privacy.
“Traditional statistical disclosure limitation methods, like those used in the 2010 census, cannot defend against modern challenges posed by enormous cloud computing capacity and sophisticated software libraries,” John Abowd, the bureau’s chief scientist, wrote in a declaration to the court. And expanding upon practices like swapping to the degree necessary to ensure privacy would “render the resulting data unusable for most data users.”
Many experts believe the threats to the data are real: A group of 20 leading privacy and cryptography experts told the court in the Alabama case that they believe the risk of census data being deanonymized is extremely high.
Meanwhile, time is running out before the Census Bureau is scheduled to publish the state-level data necessary for redistricting. Agency officials recently released another set of demonstration data for testing and are accepting feedback until late May.
The bureau is currently scheduled to finalize its differential privacy system in early June and then release the data in late September—a deadline that was already pushed back due to the pandemic. For many states, that means they won’t have the data until after the redistricting deadlines established in their constitutions.
The outcome of the Alabama case could throw another wrench into the works. The two sides met for a preliminary hearing on May 3, but as of the date of publication, the three judge panel hearing the case had not made any rulings on the merits or the request for an injunction.
Whatever the outcome, if the case is appealed, it will likely go directly to the U.S. Supreme Court.