For months now, The Markup reporters Jon Keegan and Alfred Ng have been investigating the little-known industry of data brokers that scoop up location information from apps on your phone and sell it.
In September, they identified 47 companies that buy, sell, trade, and aggregate location data. Few are household names, but their marketing materials make aggressive claims. To take just one example, a company called Near boasts that it provides “The World’s Largest Dataset of People’s Behavior in the Real-World” with data representing “1.6 billion users” in “44 countries.”
The location data traffickers are secretive about which apps they buy data from, but in December, Jon and Alfred were able to identify a major source for the industry: a family safety app called Life360 with more than 35 million users in 140 countries. Parents use the app to track their children’s movements in real time, but many apparently did not find the setting in the app to turn on the feature called “Do Not Sell My Information.”
Life360 CEO Chris Hulls told Alfred and Jon that Life360 was selling location data to “approximately one dozen data partners.” And while the company removed the most obvious identifying user information, it did not always make efforts to “fuzz,” “hash,” aggregate, or reduce the precision of the location data to preserve privacy—as per industry best practices.
In January, Life360 reversed course and said it would stop selling raw location data and end its relationships with most data brokers. It promised to instead sell aggregated data to just one firm, Placer.ai, and continue its raw-data-sharing partnership with one other, Allstate’s Arity.
“Life360 recognises that aggregated data analytics (for example, 150 people drove by the supermarket) is the wave of the future and that businesses will increasingly place a premium on data insights that do not rely on device-level or other individual user-level identifiers,” Hulls said in the announcement.
But how private is location information even when it is aggregated? To explain the risks of this particularly sensitive category of data, I turned this week to Yves-Alexandre de Montjoye, who has been studying the privacy risks of location data for more than a decade.
De Montjoye is an associate professor of applied math at Imperial College London, where he heads the Computational Privacy Group. He received his Ph.D. from MIT in 2015 and has advanced degrees in mathematics and mathematical engineering from Université catholique de Louvain, École Centrale des Arts et Manufactures, and Katholieke Universiteit Leuven.
He is a special adviser on AI and data protection to European Commission commissioner for justice Didier Reynders and a Parliament-appointed expert to the Belgian Data Protection Agency. He stated that his opinions are strictly his own and do not represent the views of any of the institutions he works for.
My interview with de Montjoye is below, edited for brevity and clarity.
Angwin: I wanted to start with your groundbreaking study from 2013 on location datasets. Can you tell me what sparked your interest in this topic?
de Montjoye: In 2010 I joined the Santa Fe Institute in New Mexico, which is a complex system research institute. I was working with large datasets, and, at the same time, I was also really interested in the topic of privacy. I was reading studies that were raising privacy concerns, and I was fascinated that there was this constant rebuttal to the studies that you shouldn’t worry because the data is anonymous. This surprised me because I was working with location data and, seeing users moving around, my intuition was that there was a disconnect between claims that data was anonymous and what was probably possible to do in terms of re-identification.
This is where the idea for the Unique in the Crowd paper came from. We wanted to develop a statistical approach to quantify what it would take, on average, to identify someone from an anonymous location dataset, and we were able to show that four data points of approximate place and time of where someone was was enough, in a data set of 1.5 million people, to uniquely identify someone 95 percent of the time.
Angwin: One thing I took away from this study was that location data is a special category of data that is inherently sensitive. Is that what you took away?
de Montjoye: What I find fascinating with location is that I see location as a universal identifier. Data on where you are simultaneously exists in a large number of datasets. If I wanted to start reconciling identities across datasets, location would be how I would do it because it is very rich and exists in so many datasets that are collected over completely different modalities, from my phone to my credit card.
When we thought of data in the past, we used to think of excel spreadsheets with thousands of people and a handful of columns. Now data means where you were every 10 minutes for over a year. In these kinds of high dimensional datasets, a few pieces of information are going to be sufficient to identify someone with high likelihood.
Angwin: The 2013 study stirred up some debates, specifically whether an attacker would or would not have access to those four data points about someone. How realistic do you think this is?
de Montjoye: Yes, some people thought that gathering four points was completely unrealistic. To that I say, “Have you been on social media at all?” I do not think it is going to be that hard to find four or six or eight points about someone.
I think we were hitting on a bit of an inconvenient truth. We were showing that de-identification techniques just didn’t really scale to the new world of big data that we are in.
We’ve seen a number of examples of this, such as the Catholic priest who was identified from Grindr data. There are a lot of applications on your phone collecting location data, and it is also available across modalities.
Angwin: It is now 2022. Have there been any new developments in re-identification?
de Montjoye: What I am interested in at the moment is the potential for what we call profiling attacks. The vast majority of re-identification attacks have been based on matching, based on me knowing where you were during a given time, on a specific Sunday at 4 p.m., and then matching what I know about you with pseudonyms in the data set.
In my opinion, the next frontier is harnessing machine learning to develop a model for how someone usually behaves and using this model to identify someone a month from now. This is something we started to do in collaboration with Michael Bronstein. We show that the way you communicate on WhatsApp, for example, is specific enough that we can learn the way you, as a person, communicate with other people without knowing who these people are, and use this to identify you even six months later.
Essentially, the way you behave is so specific—the way you answer quickly to messages, the way you exchange with a certain number of people, etc.—that we can actually learn a profile from a period of time. What we are showing is that the way we behave is very stable over time. I think this will be the next type of re-identification attacks.
Angwin: You’ve been quoted saying, “Anonymity is not a property of a data set, but is a property of how you use it.” I’m curious what you mean by that?
de Montjoye: The scientist in me sees the potential of this data for studying really important questions. There has been amazing work by people at Harvard, for example, using mobility data to predict the spread of infectious diseases, so I see some of the benefits of using this data.
However, there is also this false but very convenient notion that you just need to take a dataset, modify it one way or the other, and then it’s anonymous forever. This is not true. We need to move to a system in which we honestly acknowledge that the data is pseudonymous, and it’s not going to be super difficult for someone who has access to the data to re-identify someone.
There are some pretty good solutions out there to mitigate these risks, but they’re not perfect. We need to start seeing how we can rely on these solutions while acknowledging the remaining risks, including the fact that the data still exists in pseudonymous format. We must combine hard technical solutions with access control mechanisms, logging mechanisms, and governance mechanisms for how this data is being used.
Angwin: What is your preferred model for mitigating privacy risks while allowing scientists to answer important research questions?
de Montjoye: The notion of using data anonymously has a lot of value if protections are applied properly and correctly. To me, this means there is an ethics committee on top of the strong technical guarantees. If this is the case, I think we can get a lot of good—from a scientific perspective—information out of this data while limiting the risks. I think then the question is how to do it best technically, and there is no silver bullet.
One option is what Google has been doing with COVID—putting out “on average” mobility, probably with differential privacy applied, but that limits what you can do with the data. Another option is allowing researchers to formulate their hypotheses on synthetic data, what’s known as test data or fake data. Once your scientific hypothesis is well developed, you can send that piece of code to a server where the data exists in a pseudonymized fashion, and then after being run, only the relevant anonymized and aggregate data is sent back to you. That’s how I would technically imagine the system to work, but then the hard question is the ethics on top of it; how do you build a system that will prevent abuse.
We conduct our research in order to demonstrate what is possible. To me, it’s crucial to generate all the evidence so that we can have an honest, informed conversation and examine, as a society, how we want to balance the potential and the risks. We have to make sure we’re standing on solid ground from a scientific perspective so we can have an informed conversation.
As always, thanks for reading.
Additional Hello World research by Eve Zelickson.