Hello, friends,
Back in 2000, a dot-com startup called DoubleClick came up with a bright idea: It was going to combine the information it gleaned from tracking users across the web with the names and email addresses of people whose information it had purchased from mailing lists.
The public outrage was swift and furious. The Federal Trade Commission and the New York State attorney general each launched investigations. The state of Michigan filed a lawsuit. Business Week compared DoubleClick’s move to George Orwell’s surveillance state in an editorial titled “1984 in 2000.”
After DoubleClick’s stock plunged, the company backed away from its plan. It hired a chief privacy officer, pledged to do privacy audits, and promised the FTC that it would not merge its so-called “clickstream” data with personally identifiable information.
“We’re making the commitment that until there is agreement between government and industry on privacy standards, we will not link personally identifiable information with anonymous Web behavior,” the CEO at the time, Kevin O’Conner, told The Wall Street Journal.
Since then, the hundreds of companies in the online advertising industry have generally steered clear of being too aggressive in linking online web browsing behavior with actual names. Most online ad tracking companies just know which websites you’ve been to—based on “cookies” they place on your computer—but don’t know your actual name.
Not Facebook, though. When Facebook began placing “Like” buttons and other widgets on sites beyond Facebook, it was the first massive online advertising company that knew the name of nearly every user who stumbled across one of its widgets. At first, Facebook said it was not collecting data from users across the web, but it eventually admitted it was. And it was even tracking users if they merely visited webpages—the user didn’t have to click on the Like button to send data to Facebook or even have a Facebook account.
Now, however, Facebook has replaced the Like button with tracking technology that’s even less visible. Meta, Facebook’s parent company, now uses technology it calls a “pixel”—a snippet of computer code that websites embed on their pages—that sends data back to Facebook. Unlike the Like button, it is invisible to users, and so most people are unaware that their data is being collected. What’s even less clear is what data Facebook is collecting on any given site. To get a better idea, The Markup partnered with Mozilla to conduct the first large-scale crowdsourced study of the presence of the pixel and the data it collects in real-world scenarios.
In the first investigation in our ongoing series, investigative data journalist Surya Mattu and reporter Colin Lecher found that the U.S. Department of Education’s online student financial aid website was sending all sorts of personal information about student loan applicants to Facebook—including first and last names, email addresses, and zip codes. After we contacted the agency, it stopped sharing data with Facebook, and a Facebook spokesperson said that the company tries to educate websites on how to use its tools correctly.
This week, two Republican members of Congress from North Carolina sent a letter to U.S. Secretary of Education Miguel Cardona citing The Markup’s investigation and stating that it was “completely inappropriate” for aid applicants to be tracked by Facebook through “predatory data collection.” The agency said that it is researching what happened.
To understand these sensitive data leaks, I turned this week to Gunes Acar, who has been investigating Facebook’s overcollection of user data for years. Acar is an assistant professor at the Digital Security group of Radboud University in the Netherlands. He researches online tracking mechanisms, web security, anonymous communications, and dark patterns.
Our conversation, edited for brevity and clarity, is below.
Angwin: Tell me about your history and how you ended up becoming an expert in data leakage across the internet?
Acar: When I was a Ph.D. student in electrical engineering, a mentor pointed out this project from EFF called Panopticlick that showed that you can track users by their unique browser characteristics—a practice called browser fingerprinting. I found that spooky and interesting at the same time. My first study focused on developing tools to detect and measure browser fingerprinting at scale. And then I continued this line of research on unconventional tracking methods.
Angwin: In 2015, you published a report that provided a very comprehensive look at how Facebook was tracking users across the web. Can you talk about your findings?
Acar: This study was part of a larger investigation into Facebook’s privacy practices by the Belgian Data Protection Authority (DPA), who requested the report. My colleagues, who were law faculty at the University of Leuven, and I were tasked with investigating how Facebook tracks users across the web through social plug-ins, and we focused specifically on nonusers—people without a Facebook account.
We examined how Facebook cookies and plug-ins such as the Like button can be used to track users across the web, even if they don’t have a Facebook account. One of the surprising findings was that on the European Digital Advertising Alliance website, where users could supposedly opt out of targeted advertising, Facebook would actually place a uniquely identifying cookie that can be used to track you around the web. Facebook claimed this was a bug, and it did end up fixing the issue. Regardless, the Belgian DPA’s investigation, which our report was part of, led to a years-long litigation between the DPA and Facebook.
Angwin: Let’s talk about the Facebook pixel. Can you tell us what it is?
Acar: The Facebook pixel is a piece of code you embed on your website in order to (re)target website visitors with ads on Facebook. It can also be used to measure conversions; for example, when someone sees your advertisement on Facebook and visits your website to sign up for a subscription or to buy a product, the pixel records this information.
If a website uses the Facebook pixel, then visits to the website get linked to the visitor’s Facebook account. With the pixel, the website can say to Facebook, “Hey, this user signed up for something” or “This user just added a product to their cart.” You can combine all these activities and apply them as labels to a visitor, and then you can target visitors (“audiences”) with ads on Facebook based on these labels.
Angwin: In our recent story, we found that websites using the pixel don’t necessarily know what data they’re sending back to Facebook. Is this normal, and can you put this finding in context for us?
Acar: The Facebook pixel is present on almost 30 percent of the top 100,000 websites, so it is quite common. It’s possible that some of these websites are not fully aware of what data the pixel collects about its visitors. Unlike other trackers, the pixel doesn’t just record that you visited a website, it collects more granular information about your interaction with the site. For example, the pixel can capture a user who searches for a product, types something into a form, adds something to their cart, and much more.
Given that the pixel doesn’t have a user interface—such as the Like button—I don’t think users expect that Facebook is actually collecting this granular data about their online activities, and when this information is merged with a Facebook account, it becomes a more complete profile, allowing Facebook to make more intrusive inferences about you. This is much worse from a privacy perspective, since data from the pixel can be tied to your real name, offline activities, and your social graph.
Angwin: Can you tell us about your latest finding on a Facebook pixel technique called automatic advanced matching?
Acar: Automatic advanced matching is a feature of the Facebook pixel that more accurately matches online visitors and their activities to Facebook users. When this feature is enabled, the pixel extracts and hashes personal data that’s entered into forms, such as an email address, phone number, name, date of birth, etc. Facebook then uses those (hashed) identifiers to link your Facebook profile to your website visits and activities.
The advantage of using this over cookies is that as a user you can remove or block cookies. Many browsers like Safari and Firefox now automatically block tracking-related cookies, and there are ongoing efforts to phase out third-party cookies. This means that identifiers based on email address or phone number—that are global, unique, and persistent—will likely become more important.
Facebook claims to collect this personal information from the website forms when the user clicks the submit button, but we found that it instead collects it when you click virtually any button or link on the page. For example, even if you just type in your email address and maybe you change your mind and decide to go back to another page or read the privacy policy before opening an account, once you click any link or any button, Facebook will extract your personal information, including email address, from the form, hash it, and send it to its servers.
We also found that TikTok was using a similar method to collect personal data typed into forms. TikTok has a product called TikTok Pixel, which also has a feature to automatically harvest form data. When you type in your email address or phone number on a form, clicking almost any button triggers data collection by TikTok. [TikTok and Meta did not respond to Wired’s request for comment when it reported these findings.]
Angwin: Facebook and TikTok are not the only services collecting information before you press submit. You also discovered that other websites extract email and password information from users. Can you describe what you found?
Acar: We investigated what personal data is exfiltrated on websites before a user submits a form or consents to sharing information. For example, when you are on a website that has third-party trackers, and say you fill out a log-in form, we want to know whether this personal information is exfiltrated to trackers.
We found that on almost 3,000 out of the top 100,000 websites, a user’s email address or its hash was sent to tracker domains before any information was submitted by the user. In addition, we found that on more than 50 websites, when users typed in their password, the password was incidentally collected by third-party trackers. One other surprising finding is that when we looked into different website categories, the one where we observed the most email leaks was fashion and beauty, followed by online shopping, followed by news websites. The category with no leak at all was pornography!
Angwin: What is being done to prevent this invasive type of tracking?
Acar: We are seeing a shift, and I think this could be a reason to be hopeful. Over the last three to four years, major browsers such as Firefox and Safari have introduced new privacy preserving features that make tracking more difficult, especially across sites. We are seeing more efforts to make tracking harder. We are also seeing privacy products like DuckDuckGo [which is a donor to The Markup] and Brave, where the distinguishing feature is privacy.
However, one of the reasons we wanted to study how email addresses and other identifiers can be used for tracking is because they don’t require cookies and so they could actually bypass all these efforts to make the web more private. What worries me is that more ways around these protections will be developed to track, profile, and target people.
As always, thanks for reading.
Best,
Julia Angwin
The Markup
Additional Hello World research by Eve Zelickson.